Hash Table Collision Resolution Strategies

Hash Table Collision Resolution Strategies

A hash table is an efficient data structure that maps keys to specific positions in an array via a hash function, enabling average O(1) time complexity for insertion, deletion, and lookup operations. However, a core challenge arises: different keys may be mapped to the same array position, a situation known as a "hash collision". The primary strategies for resolving collisions fall into two main categories: Open Addressing and Separate Chaining.

一、 Problem Background: Why Do Collisions Occur?

Consider a simple hash function: hash(key) = key % 7. We take the remainder of the key value divided by 7 and use the result as the array index (array length is 7).

Insert key 14: 14 % 7 = 0, placed at index 0.
Insert key 21: 21 % 7 = 0, also maps to index 0.

At this point, index 0 is already occupied. This is a collision. We must have a strategy to decide where the new key 21 should go.

二、 Solution 1: Open Addressing

Core Idea: When a collision occurs, search for the next available slot in the hash table according to a predetermined "probe sequence" until an empty slot is found. The hash table itself (the underlying array) is the sole storage location for all elements.

Key Point: The probe sequence must be able to traverse the entire hash table; otherwise, insertion efficiency plummets as the table nears capacity. The load factor (number of stored elements / total hash table capacity) must be strictly controlled, typically requiring resizing when it exceeds 0.7-0.8.

Common probing methods include:

Linear Probing
- Description: When position i is occupied, we sequentially check i+1, i+2, i+3, ... until the end of the array, then wrap around to the beginning 0, 1, 2..., until an empty slot is found.
- Example: Using hash(key) = key % 7.
  - Insert 14 -> Position 0.
  - Insert 21 -> Position 0 collision -> Check position 1 -> Free -> Place at position 1.
  - Insert 8 -> 8 % 7 = 1 -> Position 1 occupied by 21 -> Collision -> Check position 2 -> Free -> Place at position 2.
- Advantages: Simple implementation.
- Disadvantages: Prone to "primary clustering", where contiguous occupied blocks form. Subsequent keys hashing into this block require multiple probes to find an empty slot, degrading performance.
Quadratic Probing
- Description: To mitigate primary clustering, the probe step size is the square of the probe count. The probe sequence is: i + 1², i + 2², i + 3², ... (usually modulo the table size).
- Example: Position i collision.
  - First probe: (i + 1) % table_size
  - Second probe: (i + 4) % table_size
  - Third probe: (i + 9) % table_size
- Advantages: Effectively avoids primary clustering of linear probing.
- Disadvantages: May suffer from "secondary clustering", and may not probe all positions (depends on table size and probe function design).
Double Hashing
- Description: Uses two hash functions. The first hash function hash1(key) determines the initial position. If a collision occurs, the probe step size is determined by the second hash function hash2(key). The probe sequence is: i + hash2(key), i + 2*hash2(key), i + 3*hash2(key), ... .
- Example: hash1(key) = key % 7, hash2(key) = 5 - (key % 5).
  - Insert 21, initial position i = 21 % 7 = 0 (collision).
  - Calculate step size d = 5 - (21 % 5) = 5 - 1 = 4.
  - First probe: (0 + 4) % 7 = 4, if free, place it there.
- Advantages: One of the best methods in open addressing, as different keys have different probe sequences, significantly reducing clustering.

Lookup Operation: In open addressing, to find key k, one must follow the exact same probe sequence used during insertion. If an empty slot is encountered, it indicates k is not in the table (because insertion would have placed it there upon finding an empty slot). If a deletion marker (often a special "tombstone" marker) is encountered, probing must continue.

Deletion Operation: Cannot simply empty the slot, as this would break the probe path for subsequent elements. The correct approach is to mark the slot as "deleted" (tombstone). During insertion, tombstone slots can be reused; during lookup, probing must continue upon encountering a tombstone.

三、 Solution 2: Separate Chaining

Core Idea: Instead of looking for the next empty slot, store all elements mapped to the same position in a linked list. Each position in the hash table (often called a "bucket" or "slot") no longer directly stores an element, but stores a pointer to the head of a linked list.

Key Point: Intuitive implementation; load factor can exceed 1.

Description:
1. The hash table is an array of pointers, each pointing to a linked list.
2. During insertion, compute the hash value to find the corresponding bucket (linked list), then insert the new element at the head (or tail) of that list.
3. During lookup and deletion, first find the corresponding bucket, then perform linear search and operations within the list.
Example: Again using hash(key) = key % 7.
- Insert 14 -> Place in bucket 0's list.
- Insert 21 -> Also place in bucket 0's list. Now bucket 0's list has two nodes: [21] -> [14] (assuming head insertion).
- Insert 8 -> Place in bucket 1's list.
Advantages:
- Very simple implementation.
- Effectively resolves collisions with good average performance.
- Load factor can be high, as long as lists don't become too long.
- Deletion is straightforward, simply removing the node from the list without special markers.
Disadvantages:
- Requires extra space for pointers.
- If the hash function is poorly designed, causing many elements to cluster in a few buckets, the lists become very long, degrading performance to O(n). Thus, a good hash function is crucial.
- Cache-unfriendly, as nodes are not stored contiguously in memory.

四、 Summary and Comparison

Feature	Open Addressing	Separate Chaining
Implementation Difficulty	Slightly more complex, requires handling probing and deletion markers	Simple, directly uses linked lists
Space Overhead	Smaller (uses only one array)	Larger (requires extra pointer space)
Load Factor	Must be less than 1 (typically <0.8), otherwise performance degrades sharply	Can be greater than 1, but lists should not be too long
Performance Impact	Susceptible to "clustering" phenomena	Performance depends on how well the hash function distributes data
Deletion Operation	Requires special handling (tombstone marker)	Direct list deletion, simple
Cache Performance	Good (data stored contiguously)	Poorer (data scattered)

In practice, Separate Chaining is more widely adopted (e.g., Java's HashMap) because it is more intuitive, less demanding on the hash function, and more stable when handling high load factors. Open Addressing holds advantages in scenarios requiring extreme memory savings or where cache performance is particularly critical.