Specific Implementation of Hash Tables in C++ (unordered_map)

Specific Implementation of Hash Tables in C++ (unordered_map)

The primary implementation of hash tables in the C++ Standard Library is std::unordered_map. It is an associative container that stores key-value pairs and provides average constant-time complexity for lookup, insertion, and deletion operations.

1. Basic Structure and Template Parameters

std::unordered_map is a class template, declared roughly as follows:

template<
    class Key,
    class T,
    class Hash = std::hash<Key>,
    class KeyEqual = std::equal_to<Key>,
    class Allocator = std::allocator<std::pair<const Key, T>>
> class unordered_map;
  • Key: The type of the key.
  • T: The type of the value.
  • Hash: The type of the hash function object used to compute the hash value of a key. Defaults to std::hash<Key>, for which the standard library provides specializations for fundamental types (like int, string).
  • KeyEqual: The type of the key equality comparison function object, used to compare keys for equality when hash collisions occur. Defaults to std::equal_to<Key> (i.e., using operator==).
  • Allocator: The memory allocator.

3. Internal Data Structure: Bucket Array and Linked List

The typical implementation of unordered_map combines a dynamic array (bucket array) and singly linked lists.

  • Bucket Array: A dynamically allocated array of pointers. Each array element is called a "bucket". The size of the bucket array is usually a prime number, which helps distribute hash values more uniformly.
  • Linked List Nodes: Each bucket does not directly store key-value pairs but points to the head node of a linked list. This list stores all key-value pairs that hash to the same bucket. In common C++ standard library implementations (like GCC's libstdc++ and LLVM's libc++), this list is a singly linked list.

A simplified node structure might look like this:

struct node {
    std::pair<const Key, T> data; // Stores the key-value pair, the key is const
    node* next;                   // Pointer to the next node
    // ... Possibly other implementation-specific metadata
};

The overall structure can be visualized as an "array + linked list" form:

Bucket Array: [0] -> nullptr
              [1] -> node{K1,V1} -> node{K2,V2} -> nullptr
              [2] -> nullptr
              ...
              [n] -> node{K3,V3} -> nullptr

4. Core Operation Processes

Insertion Operation (insert/emplace/operator[])

  1. Calculate Hash Value: For a given key key, use the hash function object Hash to compute its hash value: hash_value = Hash{}(key).
  2. Map to Bucket Index: Map the hash value to an index in the bucket array via modulo operation (or other compression function): bucket_index = hash_value % bucket_count. This index determines which bucket's linked list the key-value pair should be placed into.
  3. Handle Collisions and Check for Duplicate Keys:
    • Traverse the linked list corresponding to the calculated bucket index.
    • For each node in the list, use the KeyEqual comparison function to check if the node's key is equal to the key to be inserted.
    • If an equal key is found: The operation concludes based on the insertion policy (e.g., insert fails and returns an iterator to the existing element, while operator[] modifies its value).
    • If no equal key is found: Create a new node to store the key-value pair, then insert this new node into the head (or tail) of the linked list (head insertion is more common for speed). This is a constant-time operation.

Lookup Operation (find)

  1. Calculate Hash Value and Bucket Index: Exactly the same as the first two steps of the insertion operation.
  2. Traverse the Linked List: Traverse the linked list corresponding to the calculated bucket index, comparing keys for equality node by node.
  3. Return Result: If an equal key is found, return an iterator pointing to that key-value pair; if the entire list is traversed without finding it, return the end() iterator.

Deletion Operation (erase)

  1. Locate the Node: First perform steps similar to lookup to find the node to be deleted and its predecessor node (because it's a singly linked list, the predecessor is needed to modify the next pointer).
  2. Modify the Linked List: Set the predecessor node's next pointer to point to the node after the one to be deleted.
  3. Free Memory: Delete the node and free the memory it occupied.

5. Dynamic Resizing (Rehashing)

To maintain operational efficiency, unordered_map needs to resize (rehash) when the number of elements becomes too large, causing linked lists to become too long.

  • Trigger Condition: Automatic resizing is triggered when the load factor (i.e., size() / bucket_count(), the number of elements divided by the number of buckets) exceeds the max_load_factor (default is approximately 1.0). Manual resizing can also be triggered by calling rehash.
  • Resizing Process:
    1. Allocate a new, larger bucket array (the new size is typically a prime number larger than the current bucket count).
    2. Traverse all buckets in the old bucket array and all nodes on the linked lists attached to each bucket.
    3. For each key-value pair in each node, recalculate its bucket index based on the new bucket array size (new_index = hash_value % new_bucket_count).
    4. Remove each node from the old list and insert it into the new position in the linked list of the corresponding bucket in the new bucket array.
  • Performance Impact: Resizing is an O(n) operation (where n is the number of elements), so a single insertion operation can be O(n) in the worst case. However, using amortized analysis, the average time complexity over many insertions remains O(1).

6. Iterators

The iterator of unordered_map must be able to traverse all elements. Internally, it needs to maintain two pointers:

  • One pointing to the current node (within some linked list).
  • One pointing to the current bucket's position in the bucket array.
    When iteration finishes the current linked list, the iterator must move to the head of the next non-empty bucket's linked list in the bucket array. Therefore, the order of traversal for an unordered_map is unpredictable, depending on the distribution of key hash values and the bucket array size.

Summary
std::unordered_map in C++ resolves hash collisions via a "bucket array + singly linked list" structure, providing efficient key-value pair storage. Its core principle is quickly locating a bucket via the hash function and then performing precise operations within a short list. The dynamic resizing mechanism ensures stable performance. Understanding its internal structure helps in using this container more effectively in C++.