Data Sharding and Consistent Hashing in Distributed Systems

Data Sharding and Consistent Hashing in Distributed Systems

Problem Description: In distributed storage systems, when the data volume is too large to be stored on a single node, we need to split the data into multiple fragments and distribute these fragments across different server nodes. This process is called data sharding. A core challenge is: How can we efficiently assign data to nodes and, when the system scale changes (e.g., nodes are added or removed), minimize the amount of data that needs to be moved? The consistent hashing algorithm is a classic solution designed to address this problem.

Knowledge Explanation:

The Root of the Problem: The Shortcomings of Simple Hashing
- Basic Idea: The most intuitive sharding method is to use a hash function. For example, if we have 3 servers (nodes), we can compute the hash value of each data key (e.g., using hash(key) mod 3) and assign the data to the corresponding node based on the result.
- The Challenge: When the system needs to scale out (e.g., adding a server to make 4) or scale in (e.g., a server fails leaving 2), the divisor in the modulo operation changes. This means the calculation results for the vast majority of data will change (hash(key) mod 3 ≠ hash(key) mod 4). Consequently, data needs to be redistributed (migrated) on a large scale, which places significant stress on the network and servers and may cause service unavailability during this period. The reallocation cost for simple hashing is O(n), meaning almost all data needs to be moved.
The Core Idea of Consistent Hashing
- The design goal of consistent hashing is precisely to reduce the amount of data migration when the number of nodes changes. Its core innovation lies in mapping both data and nodes onto the same abstract hash ring.
- Hash Ring: Imagine a circular ring with a value range from 0 to a maximum value (e.g., 2^32 - 1), with the ends connected.
- Node Mapping: Compute the hash of each node's identifier (e.g., IP address or hostname) and map it to a position on the ring.
- Data Mapping: Compute the hash of each data key and also map it to a position on the ring.
- Data Ownership Rule: A piece of data belongs to the first node found by moving clockwise from its position on the ring.
How Consistent Hashing Works (Step-by-Step)
- Step 1: Construct the Hash Ring
  Assume our hash space is from 0 to 2^32-1. We bend this interval to form a ring.
- Step 2: Place Nodes on the Ring
  Assume we have three nodes: Node-A, Node-B, Node-C.
  We calculate hash(Node-A), suppose it yields 100.
  Calculate hash(Node-B) yields 1000.
  Calculate hash(Node-C) yields 10000.
  Now, these three nodes are placed on the ring at their respective hash values.
- Step 3: Determine the Storage Node for Data
  Now there is a piece of data with the key "user_123". We calculate hash("user_123"), suppose it yields 5000.
  Starting from position 5000, we move clockwise along the ring to find the first node encountered. Who is it?
  - Starting from 5000, do we pass 10000 (Node-C)? No, do we pass 10000 first? No, is 10000 after 5000? We need to consider the ring's property. On the ring, the numerical order is: ... -> 100 (Node-A) -> 1000 (Node-B) -> 10000 (Node-C) -> ... -> (max value) -> 0 -> ... -> 100 (Node-A) ...
  - Starting from 5000, moving clockwise: 5000 -> 5001 -> ... -> 10000. Before reaching 10000 (Node-C), we won't pass any other nodes (because 100 and 1000 are in the counterclockwise direction from 5000). Therefore, data "user_123" should be stored on Node-C.
    Another example: hash("user_456") = 50. Starting from 50, the first node found clockwise is 100 (Node-A), so it is stored on Node-A.
- Step 4: Adding and Removing Nodes (Where the Advantage of Consistent Hashing Shines)
  - Adding a Node: Suppose we add a new node Node-D with hash(Node-D) = 3000.
    Now the node order on the ring becomes: Node-A(100), Node-B(1000), Node-D(3000), Node-C(10000).
    Which data is affected? Only data whose hash values fall between 1000 and 3000 is affected. Originally, this data belonged to the next clockwise node, Node-C. Now it needs to be reassigned to Node-D.
    Key Point: Only the data in the small segment between the new node's position on the ring and its preceding (counterclockwise) node is affected. All other data remains untouched. The amount of data migration is reduced from O(n) to O(n/k), where k is the total number of nodes, significantly reducing network overhead.
  - Node Failure: Suppose Node-B fails. The data originally stored on Node-B (i.e., data with hash values between 100 and 1000) needs to find a new home. According to the rule, they will find the next clockwise node, which is Node-D (3000).
    Similarly, only the segment of data for which the failed node was responsible needs to be reassigned.
Optimization of Consistent Hashing: Virtual Nodes
- Existing Problem: Basic consistent hashing can lead to unbalanced data distribution. On one hand, if the number of nodes is small, their hashed positions on the ring might be very uneven, causing some nodes to be responsible for very large ring segments (storing more data) and others for very small ones. On the other hand, nodes themselves may have different hardware capabilities; we want more powerful nodes to handle more load.
- Solution: Introduce the concept of virtual nodes. A physical node no longer corresponds to a single point on the ring but to multiple virtual nodes. For example, Node-A corresponds to virtual nodes A-1, A-2, A-3; Node-B to B-1, B-2, B-3; Node-C to C-1, C-2, C-3. We map these virtual nodes onto the ring.
- How it Works: Data is first mapped to the ring via hashing, then assigned to a virtual node, and finally, through the mapping between virtual nodes and physical nodes, its final physical storage location is determined.
- Advantages:
  1. Load Balancing: Since each physical node has multiple virtual nodes scattered across the ring, when a physical node is overloaded, its virtual nodes can be distributed more evenly to balance the load. Similarly, more powerful physical nodes can be assigned more virtual nodes.
  2. Smooth Handling of Node Failures: When a physical node goes offline, the data for which its corresponding virtual nodes are responsible is evenly distributed among the remaining multiple physical nodes on the ring, avoiding concentrating all pressure onto the next node. When a new node comes online, it also receives data from multiple existing nodes, making the process smoother.

Summary:
Consistent hashing elegantly solves the problem of large-scale data migration during dynamic node changes in distributed systems by mapping both data and nodes onto the same hash ring. Its core advantage is that node changes only affect data in the adjacent segment of the ring, achieving minimized data migration. Furthermore, the optimization through virtual nodes addresses data skew and load balancing issues, making consistent hashing one of the core technologies for data sharding in modern distributed systems (such as Redis Cluster, Amazon Dynamo, Apache Cassandra, etc.).