Specific Implementation of Hash Tables in Python (dict)
Description
The dictionary (dict) in Python is a typical implementation of a hash table, used to store key-value pairs. It supports efficient insertion, lookup, and deletion operations with an average time complexity of O(1). Unlike Java's HashMap, Python's dict uses open addressing to resolve hash collisions and ensures performance through dynamic resizing. This topic will detail the internal structure of dict, hash collision resolution strategies, resizing mechanisms, and memory management.
1. Internal Structure: Index Table and Entry Array
Python's dict consists of two core arrays:
- Index Table (Indices): A sparse array storing indices (integers) of entries in the entry array. Initial size is 8, with each position storing -1 (indicating an empty slot).
- Entry Array (Entries): Stores the actual key-value pair data, each entry containing three fields: hash value, key, and value. Entries are arranged in insertion order.
Example: Inserting key "a" (hash value 97)
- Calculate index:
97 % 8 = 1(index table size is 8). - Check index table position 1: If it's -1, append a new entry to the entry array (hash=97, key="a", value=...), and update index table position 1 to this entry's index (e.g., 0).
- If position 1 is already occupied (hash collision), enter the collision resolution process.
2. Hash Collision Resolution: Pseudo-Random Probing
Python uses Pseudo-Random Probing from the open addressing method to resolve collisions:
- When the target index is occupied, calculate the next probe position using the formula:
new_index = (5 * current_index + 1) % 2^i(where i is the exponent of the index table size, e.g., i=3 when size is 8). - The probe sequence covers all indices, avoiding clustering.
Example: Assuming index 1 is occupied, continuing to insert key "b" (hash value 98):
- Calculate initial index:
98 % 8 = 2. - If index 2 is free, insert directly; if occupied, calculate the next index using the formula:
- First probe:
(5*2 + 1) % 8 = 11 % 8 = 3 - Second probe:
(5*3 + 1) % 8 = 16 % 8 = 0, and so on until an empty slot is found.
- First probe:
3. Dynamic Resizing (Expansion and Shrinking)
Resizing is triggered when the number of entries reaches 2/3 of the index table size to reduce collision probability:
- Expansion Rule: New index table size is 4 times the current size (e.g., 8→32), until exceeding 50,000 entries, then it becomes 2 times.
- Rehashing: All entries are remapped to the new index table (
hash % new_size). - Shrinking Condition: After deleting a large number of entries, if the entry count is less than 1/2 of the index table size, it may shrink to a minimum of 8.
Example: A dict with initial size 8 triggers expansion when 6 entries are inserted (6 ≥ 8*2/3 ≈ 5.33):
- New index table size is 32.
- Traverse the entry array, recalculating the index for each key (e.g.,
97 % 32 = 1). - Initialize the new index table with -1 and refill with entry indices.
4. Memory Optimization: Compact Layout
Python 3.6+ dict uses a compact structure to improve cache efficiency:
- The entry array stores entries in insertion order; during iteration, the entry array is traversed directly (not the index table), ensuring order preservation.
- The index table only stores mapping relationships, while the entry array stores key-value pairs contiguously, reducing memory fragmentation.
5. Special Mechanism: Deletion Operation and Tombstone Marking
When deleting an entry, to avoid breaking the probe chain, Tombstone Marking is used:
- Mark the corresponding position in the index table with a special value (e.g., -2), instead of directly setting it to -1.
- Tombstone positions can be reused when inserting new entries; tombstones are cleared during resizing.
Example: Deleting key "a" (original index 1):
- Mark position 1 in the index table as a tombstone (-2).
- Retain the entry in the entry array but mark it as invalid.
- When subsequently inserting key "c" (hash value 99), if the calculated index is 1 and it's found to be a tombstone, overwrite it directly.
Summary
Python's dict achieves an efficient and memory-friendly hash table through mechanisms such as separating the index table and entry array, pseudo-random probing, dynamic resizing, and tombstone marking. Its design offers unique advantages in collision resolution, ordered iteration, and cache optimization, serving as an exemplary combination of engineering practice and theory.