Principles of Hash Tables and Collision Resolution

Principles of Hash Tables and Collision Resolution

A Hash Table is an efficient data structure used for storing and retrieving key-value pairs. It maps keys to specific positions in an array via a hash function, enabling average-case O(1) time complexity operations. However, the hash function may map different keys to the same position, causing a collision. Thus, collision resolution strategies are necessary. Below, we explain the core principles of hash tables and common collision resolution methods step by step.

1. Basic Structure of Hash Tables

The core of a hash table is an array (called the bucket array), where each position is called a "bucket." The hash function converts a key into an array index using the formula:

\[\text{index} = h(key) \mod \text{bucket array size} \]

For example, for the key "apple," the hash function computes its hash value, takes the modulo, and obtains the index position to store the corresponding value (e.g., price).

Key Points:

The hash function should distribute keys uniformly to avoid clustering.
The bucket array size is usually a prime number to improve distribution uniformity.

2. Occurrence of Hash Collisions

A collision occurs when two different keys map to the same index via the hash function. For example:

Keys "apple" and "banana" both map to index 2 after modulo operation.
Without handling collisions, the later stored key would overwrite the previous value.

3. Collision Resolution Strategies

Collision resolution methods are mainly divided into two categories: Open Addressing and Chaining.

3.1 Separate Chaining

Principle: Each bucket does not directly store a key-value pair but stores a linked list (or another data structure, such as a red-black tree). All key-value pairs mapping to the same index are added to this list.

Operation Process:

Insert: Calculate the key's index. If the bucket is empty, create a new linked list node; if not, traverse the list:
- If the key already exists (based on key comparison), update its value.
- If the key does not exist, add the new key-value pair to the end (or head) of the list.
Search: Calculate the index, traverse the list, and check if any key matches.
Delete: Calculate the index, traverse the list, find the key, and delete the node.

Advantages:

Simple to implement; the linked list can expand dynamically.
Suitable for scenarios with frequent insertions and deletions.

Disadvantages:

Performance degrades to O(n) when the linked list becomes too long. Optimization: When the list length exceeds a threshold, it can be converted to a balanced tree (e.g., treeification strategy in Java's HashMap).

Example:
Assume the bucket array size is 5, and keys "apple" and "banana" both map to index 2:

Linked list at index 2: ["apple":1.2] → ["banana":0.8]

3.2 Open Addressing

Principle: All key-value pairs are stored directly in the bucket array. When a collision occurs, the next empty bucket is searched according to a probing sequence.

Common Probing Methods:

Linear Probing:
- On collision, check the next indices sequentially: \(h(key), h(key)+1, h(key)+2, \dots\)
- Issue: Prone to clustering, reducing efficiency.
Quadratic Probing:
- On collision, check indices with squared offsets: \(h(key), h(key)+1^2, h(key)+2^2, \dots\)
- Reduces clustering but may not traverse all buckets.
Double Hashing:
- Use a second hash function to calculate the step size: \(h_1(key), h_1(key) + i \cdot h_2(key)\)
- More uniform distribution but higher computational cost.

Operation Process:

Insert: Calculate the initial index. If the bucket is empty, insert; if occupied, search for an empty bucket using the probing sequence.
Search: Traverse using the same probing sequence. If an empty bucket is encountered, the key does not exist.
Delete: Cannot simply clear the bucket, as it would break the probing sequence. Typically, mark it as "deleted" (tombstone) for reuse during insertion.

Advantages:

No extra data structures needed; good memory locality.
Suitable for fixed data size and low load factor scenarios.

Disadvantages:

The load factor (number of stored key-value pairs / total buckets) must be controlled (typically <0.7), or performance degrades.
Deletion is complex.

4. Load Factor and Resizing

Load Factor: Measures how full the hash table is. When the load factor exceeds a threshold (e.g., 0.75), collision probability increases, requiring dynamic resizing (rehashing):

Create a larger new bucket array (usually doubled).
Recalculate hash values for all keys and insert them into the new array.
Resizing is costly but amortized to O(1) time complexity.

5. Summary

Core of Hash Tables: Hash function + collision resolution strategy.
Strategy Selection:
- Separate chaining is more general and suitable for scenarios with unknown data volume.
- Open addressing is memory-efficient and suitable for memory-constrained environments.
Practical Applications: Java's HashMap uses separate chaining; Python's dict uses open addressing.

With a well-designed hash function and controlled load factor, hash tables can provide efficient operations in most scenarios.