Data Sharding Strategies in Distributed Systems

Data Sharding Strategies in Distributed Systems

Problem Description
Data sharding is a key technique in distributed systems for horizontally partitioning large-scale datasets into smaller subsets (called shards), aiming to address single-node storage and performance bottlenecks. Interviews often require you to design a sharding scheme and solve core issues such as shard key selection, data balancing, and cross-shard queries. The challenges include how to avoid data skew, efficiently route requests, and handle dynamic shard scaling.

Solution Process

Understanding the Essence and Goals of Sharding
- Core Goal: Distribute data across multiple independent nodes to improve the system's storage capacity, read/write throughput, and scalability.
- Key Trade-off: The finer the sharding granularity, the more balanced the load, but the higher the complexity of cross-shard operations. A balance must be struck between uniform distribution and minimizing cross-shard interactions.
- Example: If a user table has 1 billion records, sharding by user ID across 100 nodes means each node stores only about 10 million records, significantly reducing single-point pressure.
Shard Key Selection Strategies
- Principles: The shard key should have high cardinality (diverse values), uniform access frequency, and avoid hotspots. Common fields include user ID, order ID, etc.
- Pitfall: Sharding by "gender" (only 2-3 values) would severely limit the number of shards and cause significant data skew.
- Composite Shard Key: For example, combining (user ID, order time)—sharding first by hashing user ID, then sorting by time within the same shard—balances distribution and range query efficiency.
Detailed Sharding Algorithms
- Range Sharding
  - Principle: Partition by the value range of the shard key (e.g., user IDs 1-10 million to shard 1, 10-20 million to shard 2).
  - Advantages: Supports range queries (e.g., querying orders within a time period), with adjacent data physically clustered.
  - Disadvantages: Prone to hotspots (new data concentrated in the last shard), requiring dynamic adjustment of shard boundaries.
- Hash Sharding
  - Principle: Compute a hash value for the shard key (e.g., MD5 or consistent hashing) and assign shards based on the modulo of the hash value.
  - Advantages: Even data distribution, avoiding hotspots.
  - Disadvantages: Cross-shard range queries require merging results from all shards, which is inefficient.
- Consistent Hashing Optimization
  - Solves the issue of massive data migration during scaling with hash modulo: organizes hash values into a ring, where only adjacent data is affected by node changes.
  - Introduces virtual nodes: Maps each physical node to multiple virtual nodes for further load balancing.
Shard Routing Mechanisms
- Client-Side Routing: The client embeds a routing table (e.g., mapping shard keys to nodes) and directly accesses the target shard.
- Proxy Layer Routing: Uses an independent proxy (e.g., ShardingSphere, Vitess) to parse SQL and forward requests.
- Centralized Routing Table: Maintains shard metadata in a database, requiring consistency guarantees during node changes.
Handling Cross-Shard Operations
- Cross-Shard Queries:
  - Aggregate queries (e.g., SUM, AVG) are broadcast by a coordinator node to all shards, with results merged.
  - Paginated queries require local sorting on each shard followed by global merge sorting, avoiding deep pagination performance pitfalls.
- Cross-Shard Transactions: Use distributed transaction protocols (e.g., 2PC, Saga) to ensure atomicity, but at the cost of performance.
Dynamic Shard Management
- Scaling Steps:
  1. Prepare a new node and add it to the cluster.
  2. Migrate some data to the new node (e.g., in consistent hashing, only adjacent data is migrated).
  3. Update routing metadata and switch traffic.
- Rebalancing Strategies: Monitor shard size and QPS, automatically trigger data migration to avoid manual intervention.
- Tool Support: E.g., MongoDB's Balancer, Elasticsearch's Shard Allocation.
Practical Case: E-commerce Order Table Sharding Design
- Shard Key: Choose order ID (high cardinality) and user ID (frequently queried) as composite keys.
- Sharding Algorithm: Use consistent hashing by order ID first to ensure even writes; simultaneously store redundantly by user ID to optimize user-dimension queries.
- Cross-Shard Queries: Queries for a user's order list are routed to a specific shard by user ID, avoiding full-table scans.

Summary
Data sharding requires systematic consideration of key selection, algorithms, routing, and operations. In interviews, demonstrate an understanding of the trade-offs between data distribution, query patterns, and scalability, and emphasize the importance of monitoring and automation in production environments.