Hash Table Collision Resolution Strategies
A hash table is an efficient data structure that maps keys to specific positions in an array via a hash function, enabling average O(1) time complexity for insertion, deletion, and lookup operations. However, a core challenge arises: different keys may be mapped to the same array position, a situation known as a "hash collision". The primary strategies for resolving collisions fall into two main categories: Open Addressing and Separate Chaining.
一、 Problem Background: Why Do Collisions Occur?
Consider a simple hash function: hash(key) = key % 7. We take the remainder of the key value divided by 7 and use the result as the array index (array length is 7).
- Insert key
14:14 % 7 = 0, placed at index 0. - Insert key
21:21 % 7 = 0, also maps to index 0.
At this point, index 0 is already occupied. This is a collision. We must have a strategy to decide where the new key 21 should go.
二、 Solution 1: Open Addressing
Core Idea: When a collision occurs, search for the next available slot in the hash table according to a predetermined "probe sequence" until an empty slot is found. The hash table itself (the underlying array) is the sole storage location for all elements.
Key Point: The probe sequence must be able to traverse the entire hash table; otherwise, insertion efficiency plummets as the table nears capacity. The load factor (number of stored elements / total hash table capacity) must be strictly controlled, typically requiring resizing when it exceeds 0.7-0.8.
Common probing methods include:
-
Linear Probing
- Description: When position
iis occupied, we sequentially checki+1,i+2,i+3, ... until the end of the array, then wrap around to the beginning0, 1, 2..., until an empty slot is found. - Example: Using
hash(key) = key % 7.- Insert
14-> Position 0. - Insert
21-> Position 0 collision -> Check position 1 -> Free -> Place at position 1. - Insert
8->8 % 7 = 1-> Position 1 occupied by21-> Collision -> Check position 2 -> Free -> Place at position 2.
- Insert
- Advantages: Simple implementation.
- Disadvantages: Prone to "primary clustering", where contiguous occupied blocks form. Subsequent keys hashing into this block require multiple probes to find an empty slot, degrading performance.
- Description: When position
-
Quadratic Probing
- Description: To mitigate primary clustering, the probe step size is the square of the probe count. The probe sequence is:
i + 1²,i + 2²,i + 3², ... (usually modulo the table size). - Example: Position
icollision.- First probe:
(i + 1) % table_size - Second probe:
(i + 4) % table_size - Third probe:
(i + 9) % table_size
- First probe:
- Advantages: Effectively avoids primary clustering of linear probing.
- Disadvantages: May suffer from "secondary clustering", and may not probe all positions (depends on table size and probe function design).
- Description: To mitigate primary clustering, the probe step size is the square of the probe count. The probe sequence is:
-
Double Hashing
- Description: Uses two hash functions. The first hash function
hash1(key)determines the initial position. If a collision occurs, the probe step size is determined by the second hash functionhash2(key). The probe sequence is:i + hash2(key),i + 2*hash2(key),i + 3*hash2(key), ... . - Example:
hash1(key) = key % 7,hash2(key) = 5 - (key % 5).- Insert
21, initial positioni = 21 % 7 = 0(collision). - Calculate step size
d = 5 - (21 % 5) = 5 - 1 = 4. - First probe:
(0 + 4) % 7 = 4, if free, place it there.
- Insert
- Advantages: One of the best methods in open addressing, as different keys have different probe sequences, significantly reducing clustering.
- Description: Uses two hash functions. The first hash function
Lookup Operation: In open addressing, to find key k, one must follow the exact same probe sequence used during insertion. If an empty slot is encountered, it indicates k is not in the table (because insertion would have placed it there upon finding an empty slot). If a deletion marker (often a special "tombstone" marker) is encountered, probing must continue.
Deletion Operation: Cannot simply empty the slot, as this would break the probe path for subsequent elements. The correct approach is to mark the slot as "deleted" (tombstone). During insertion, tombstone slots can be reused; during lookup, probing must continue upon encountering a tombstone.
三、 Solution 2: Separate Chaining
Core Idea: Instead of looking for the next empty slot, store all elements mapped to the same position in a linked list. Each position in the hash table (often called a "bucket" or "slot") no longer directly stores an element, but stores a pointer to the head of a linked list.
Key Point: Intuitive implementation; load factor can exceed 1.
-
Description:
- The hash table is an array of pointers, each pointing to a linked list.
- During insertion, compute the hash value to find the corresponding bucket (linked list), then insert the new element at the head (or tail) of that list.
- During lookup and deletion, first find the corresponding bucket, then perform linear search and operations within the list.
-
Example: Again using
hash(key) = key % 7.- Insert
14-> Place in bucket 0's list. - Insert
21-> Also place in bucket 0's list. Now bucket 0's list has two nodes:[21] -> [14](assuming head insertion). - Insert
8-> Place in bucket 1's list.
- Insert
-
Advantages:
- Very simple implementation.
- Effectively resolves collisions with good average performance.
- Load factor can be high, as long as lists don't become too long.
- Deletion is straightforward, simply removing the node from the list without special markers.
-
Disadvantages:
- Requires extra space for pointers.
- If the hash function is poorly designed, causing many elements to cluster in a few buckets, the lists become very long, degrading performance to O(n). Thus, a good hash function is crucial.
- Cache-unfriendly, as nodes are not stored contiguously in memory.
四、 Summary and Comparison
| Feature | Open Addressing | Separate Chaining |
|---|---|---|
| Implementation Difficulty | Slightly more complex, requires handling probing and deletion markers | Simple, directly uses linked lists |
| Space Overhead | Smaller (uses only one array) | Larger (requires extra pointer space) |
| Load Factor | Must be less than 1 (typically <0.8), otherwise performance degrades sharply | Can be greater than 1, but lists should not be too long |
| Performance Impact | Susceptible to "clustering" phenomena | Performance depends on how well the hash function distributes data |
| Deletion Operation | Requires special handling (tombstone marker) | Direct list deletion, simple |
| Cache Performance | Good (data stored contiguously) | Poorer (data scattered) |
In practice, Separate Chaining is more widely adopted (e.g., Java's HashMap) because it is more intuitive, less demanding on the hash function, and more stable when handling high load factors. Open Addressing holds advantages in scenarios requiring extreme memory savings or where cache performance is particularly critical.