Fault Recovery and Fault Tolerance Mechanisms of Consistent Hashing

Fault Recovery and Fault Tolerance Mechanisms of Consistent Hashing


1. Problem Background

Consistent hashing is commonly used for load balancing in distributed systems, such as cache clusters (e.g., Redis, Memcached) or distributed databases. In these systems, nodes (servers) may fail or go offline due to network or hardware issues, or be added for capacity expansion. In such cases, we want the system to minimize data migration and maintain high availability when nodes change.


2. Basics Review

  • Consistent hashing maps nodes and data onto the same hash ring (0 ~ 2^m-1), typically using virtual nodes for load balancing.
  • Data mapping rule: After hashing the data key, the first node found clockwise on the ring is the owning node.

3. Handling Node Failure

When a node fails, requests originally mapped to it will fail and need to be quickly transferred to other nodes.

Step 1: Failure Detection

  • Usually detected via a heartbeat mechanism (e.g., pinging every 5 seconds).
  • If heartbeats time out consecutively multiple times, the node is considered failed and removed from the ring.

Step 2: Data Remapping

  • Data originally owned by the failed node is automatically taken over by the next node clockwise.
  • However, note: The next node may become overloaded (especially without virtual nodes).

Step 3: Load Balancing Recovery

  • The virtual node mechanism ensures that the virtual nodes of the failed node are evenly distributed on the ring, so its load is shared by multiple other nodes, preventing any single node from being overloaded.

Example

Suppose the ring has three physical nodes A, B, C, each with 3 virtual nodes (A1, A2, A3, ...), and data is evenly distributed.
If node B fails, its virtual nodes B1, B2, B3 are removed from the ring. Data originally mapping to B1 will now find C1 (a virtual node of C) clockwise, data for B2 may find A3, etc.
Thus, B's load is shared by A and C.


4. Handling New Node Addition

When adding a new node, only a portion of data needs to be migrated from adjacent nodes, leaving most data unaffected.

Step 1: Node Joining

  • The new node registers with a management service, generates a batch of virtual nodes, and inserts them at corresponding positions on the ring.

Step 2: Data Migration

  • The new node's immediate successor node clockwise will have a portion of data that originally belonged to it but now maps to the new node.
  • This portion of data needs to be migrated from the successor node to the new node. The migration process is usually performed online to avoid service interruption.

Example

Originally, the ring has nodes A and C. Data in interval [A, C) belongs to C, and [C, A) belongs to A.
Now, node B is added between A and C. Data in interval [A, B) still belongs to A, while data in interval [B, C), which originally belonged to C, now needs to be migrated to B.


5. Fault Tolerance Mechanism: Replication

Consistent hashing is often combined with data replication to improve fault tolerance.

Replication Strategy

  • Each data object is stored on N consecutive nodes clockwise (including the primary node), where N is the replication factor (e.g., 3).
  • During writes, all N nodes are written simultaneously; during reads, data can be read from the primary or replica nodes.

Replication Mechanism During Node Failure

  • When the primary node fails, the next replica node can take over to continue service. Meanwhile, the system automatically rebuilds the replica on another node to maintain the replication factor N.

Example

Assume N=3. Data D maps to node B (primary replica) and is also stored on nodes C and D (replicas).
If B fails, C becomes the primary replica to serve requests, and a third replica is rebuilt on a new node E to maintain the replication count.


6. Fault Recovery and Data Rebalancing

When a failed node comes back online, or a new replacement node is added, data recovery needs to be considered.

Scenario 1: Failed Node Rejoins

  • The node rejoins the ring, but its data may have been taken over by other nodes during the failure period.
  • Data originally belonging to it needs to be migrated back from the current holding nodes. However, migrating all data should be avoided; only the portion that originally belonged to it should be migrated (determined by recorded metadata).

Scenario 2: New Node Replaces Failed Node

  • The system replaces the old node with a new one, which has new virtual node positions.
  • Data originally belonging to this logical position needs to be migrated back from other nodes while maintaining load balance.

7. Implementation Considerations

  • Metadata Management: A central coordinator (e.g., ZooKeeper, etcd) or a P2P-based protocol is needed to maintain the node list and virtual node mappings.
  • Concurrent Migration: During data migration, service interruption should be avoided. Batch migration is commonly used, handling both old and new versions of data during the process.
  • Consistency Guarantee: In a distributed environment, clients must be able to perceive the new node mappings during node changes (e.g., through version numbers or heartbeat updates to routing tables).

8. Summary

  • Consistent hashing handles node failures via automatic takeover by adjacent nodes on the ring, combined with virtual nodes to distribute the load.
  • Replication mechanisms improve availability, allowing failover to replicas.
  • During recovery, data rebalancing is required, but migration only involves a small portion of data related to the changed node and its successor, demonstrating consistent hashing's advantage of low migration cost.

Ultimate Goal: When nodes change dynamically, the system can automatically adjust and recover quickly, maintaining high performance and high availability, transparent to upper-layer services.