Database Data Sharding and Routing Strategies

Database Data Sharding and Routing Strategies

Description
Data sharding (Sharding) is a core technology of distributed databases. It involves splitting large datasets into multiple subsets (shards) according to specific rules and distributing them for storage across different nodes. Routing strategies are responsible for locating the correct shard where the data resides during queries. A well-designed approach can significantly enhance system scalability and performance. The following is a detailed explanation of the key steps.

1. Shard Key Selection

  • Purpose: The shard key is the basis for data partitioning (e.g., user ID, order time). Its selection must consider data distribution uniformity and query patterns.
  • Principles:
    • High Cardinality: Key values should have high uniqueness (e.g., primary key) to avoid data skew.
    • Business Relevance: Fields often used as query conditions (e.g., user ID) can reduce cross-shard queries.
    • Example: An e-commerce order table is sharded by user_id because queries are mostly user-based.

2. Sharding Algorithms

  • Range-Based Sharding:
    • Partition based on key value ranges (e.g., [1-1000] to shard A, [1001-2000] to shard B).
    • Advantage: Efficient for range queries (e.g., WHERE id BETWEEN 100 AND 500).
    • Disadvantage: May cause hotspots due to uneven data distribution (e.g., new data concentrates on the last shard).
  • Hash-Based Sharding:
    • Distribute data evenly by applying a hash function and modulo operation on the shard key (e.g., hash(user_id) % 4).
    • Advantage: Balanced data distribution, avoiding hotspots.
    • Disadvantage: Inefficient for range queries as all shards may need to be scanned.
  • Consistent Hashing:
    • Uses a hash ring to manage shards dynamically. Adding or removing nodes only affects adjacent data, minimizing migration.
    • Application Scenario: Suitable for systems with frequent node changes (e.g., cloud databases).

3. Implementation of Routing Strategies

  • Client-Side Routing:
    • Calculates the shard location at the application layer (e.g., using a routing table or hash function).
    • Process: Application parses SQL → Extracts shard key → Calculates target shard → Sends query.
    • Advantage: Lightweight, avoids single-point bottlenecks.
  • Proxy Middleware Routing:
    • Forwards requests through an independent proxy (e.g., MyCat, ShardingSphere).
    • Process: Application sends query to proxy → Proxy parses and routes → Aggregates results and returns.
    • Advantage: Transparent to the application, supports advanced features (e.g., read/write separation).
  • Centralized Routing Table:
    • Stores shard mappings (e.g., shard key range → shard node) in a metadata server.
    • Query first accesses metadata, then locates the shard.
    • Note: The metadata server requires high availability to avoid single points of failure.

4. Cross-Shard Query Handling

  • Scenario: Queries that cannot be located via the shard key (e.g., searching by a non-shard key field).
  • Solutions:
    • Broadcast Query: Sends the request to all shards and merges the results (e.g., SELECT * FROM orders WHERE status='pending').
    • Merge Sort: The proxy node collects results from each shard, then performs sorting and pagination (e.g., LIMIT 20 requires global sorting).
    • Optimization: Create secondary index tables for frequent cross-shard queries (e.g., mapping status to the shard key).

5. Shard Expansion and Rebalancing

  • Trigger Conditions: Data growth or node failures necessitate data redistribution.
  • Steps:
    1. Create new shards and update routing rules.
    2. Migrate data: Write to both old and new shards (dual-write) and synchronize incremental data.
    3. Switch traffic: After verifying consistency, decommission old shards.
  • Challenge: Minimize service interruption during migration. Common tools include Vitess for online shard migration.

Summary
Data sharding and routing are foundational to distributed systems. Shard keys and algorithms must be chosen based on business characteristics, and routing strategies should balance query efficiency with scalability. In practice, it is also essential to monitor data skew and design smooth expansion plans.