Consistent Hashing Algorithm: Principles and Applications

Consistent Hashing Algorithm: Principles and Applications

Problem Description
Consistent hashing is a special hashing technique primarily used to address the issue in distributed caching systems where, when the number of server nodes changes (increase or decrease), the amount of data that needs to be remapped is minimized, thereby avoiding large-scale data migration problems.

Core Problem
In traditional hash tables (e.g., modulo operation on the number of servers), if the number of server nodes changes from N to N+1, almost all data needs to have its hash recalculated and be migrated to a new server. This leads to significant network overhead and service unavailability in distributed systems.

Solution Process

Step 1: Understand the Hash Ring Space

  1. First, imagine a ring, typically with a range from 0 to 2^32 - 1. This range is large enough to be considered a continuous circle.
  2. The consistent hashing algorithm also uses a hash function (e.g., MD5, SHA-1), but instead of directly mapping to a server, it first maps both server nodes and data onto this ring.

Step 2: Map Nodes to the Ring

  1. For each cache server (identified by, e.g., IP address or name), we use the hash function to calculate its hash value, then take modulo 2^32 to ensure the result falls on the ring [0, 2^32-1].
    • Assume we have three nodes: Node-A, Node-B, Node-C.
    • After calculation, hash(Node-A) % 2^32 -> point A on the ring.
    • Similarly, we get points B and C. These three points divide the ring into three arcs.

Step 3: Map Data to the Ring and Locate the Server

  1. For each piece of data to be cached (identified by a key key), we also calculate its hash value using the same hash function, take modulo 2^32 to determine its position on the ring, say point K.
  2. The storage rule for data is: starting from point K, search clockwise on the ring, and the first server node encountered is the node where this data should be stored.
    • For example, if point K lies between point A and point B (clockwise), then the data is stored on Node-B.
    • If point K lies between point C and point A (i.e., between the end and the beginning of the ring), then the data is stored on Node-A.

Step 4: Analyze Node Addition (Core Advantage)

  1. Now, suppose we add a new server, Node-D. Through hash calculation, it is mapped to point D on the ring. Assume point D is located between point A and point B.
  2. At this point, the ring is repartitioned. Only the small portion of data originally stored on Node-B, whose hash values lie between point A and point D, need to be migrated to the new Node-D.
  3. All other data on Node-B, and all data on Node-A and Node-C, require no changes.
  4. Conclusion: When adding a node, only a small adjacent segment of data on the ring is affected, achieving minimal data migration.

Step 5: Analyze Node Removal

  1. Similarly, if a server (e.g., Node-B) fails and is removed.
  2. Then, the data originally stored on Node-B now needs to be reassigned according to the rule (clockwise search for the first node) to Node-B's next node (assume it's Node-C).
  3. Only the data originally belonging to Node-B is affected and needs to be migrated to Node-C. Data on other nodes remains unchanged.

Step 6: Solving the Balance Problem – Virtual Nodes

  1. Problem Revealed: In the basic consistent hashing described above, if the number of nodes is small or the nodes are distributed unevenly on the ring after hashing, it's easy to cause data skew—where one node handles most of the data while others handle very little.
  2. Solution: Introduce the concept of "virtual nodes."
    • Virtual nodes are replicas of real nodes. One real node corresponds to multiple virtual nodes on the ring.
    • For example, create 1000 virtual nodes for Node-A: A-#1, A-#2, ..., A-#1000. Each virtual node has an independent hash value and is distributed at different positions on the ring. Do the same for Node-B and Node-C.
    • The data mapping rule remains the same: data is still mapped to the first virtual node encountered clockwise. However, this virtual node ultimately points to its corresponding real node (e.g., virtual node A-#550 belongs to real node Node-A).
  3. Benefits of Virtual Nodes:
    • Load Balancing: Since a real node is split into multiple virtual nodes scattered across the ring, data distribution becomes more uniform, effectively avoiding data skew.
    • Flexible Weighting: More virtual nodes can be assigned to servers with better performance, allowing them to handle more data and achieve weighted load distribution.

Summary
Consistent hashing cleverly addresses the problem of large-scale data migration caused by node changes in distributed environments by constructing a hash ring and mapping both data and nodes onto it. By introducing the virtual node mechanism, it further optimizes uniform data distribution and system load balancing capabilities. It is one of the core foundational algorithms for building large-scale distributed systems (such as Redis Cluster, Memcached clusters).