Database Sharding Strategy and Implementation

Database Sharding Strategy and Implementation

1. What is Database Sharding?

Database sharding is a technique that horizontally splits a large database into smaller, more manageable parts (called "shards"). Each shard independently stores a portion of the data, and all shards together constitute the complete logical database. The goal of sharding is to address the performance bottlenecks of a single-machine database when facing massive data volumes or high concurrent requests.


2. Why Sharding is Needed?

  • Single Machine Capacity Limit: Disk space, memory, or CPU cannot support massive data.
  • Performance Pressure: High-concurrency reads and writes lead to increased I/O or lock contention.
  • Availability Requirements: The risk of single point of failure needs to be mitigated through a distributed architecture.

Comparison with Database and Table Partitioning:

  • Database and table partitioning are specific implementations of sharding; sharding places more emphasis on "how to partition data and route queries."

3. Key Problems and Solutions in Sharding

Problem 1: How to Partition Data?

Common Sharding Strategies:

  1. Range-based Sharding

    • Partition data based on a continuous range of a specific field (e.g., user IDs 1-1,000,000 go to shard 1, 1,000,001-2,000,000 to shard 2).
    • Advantages: Efficient for range queries (e.g., BETWEEN 1 AND 50 only accesses one shard).
    • Disadvantages: May create data hotspots (e.g., new users concentrated in the last shard).
  2. Hash-based Sharding

    • Calculate a hash value for the shard key (e.g., user ID) and assign based on modulo operation (e.g., hash(user_id) % 4 determines the shard).
    • Advantages: Even data distribution, avoiding hotspots.
    • Disadvantages: Range queries require scanning all shards, leading to low efficiency.
  3. Consistent Hashing

    • Addresses the issue of massive data migration during scaling (adding/removing shards). Only affects data in adjacent shards.

Problem 2: How to Route Queries?

Sharding Architecture Types:

  1. Client-side Sharding: The application layer directly calculates the target shard and connects to the corresponding database (e.g., ShardingSphere-JDBC).
  2. Proxy-based Sharding: Routes queries to shards through middleware (e.g., MyCat, ProxySQL) that parses SQL, transparent to the application.
  3. Database-native Sharding: Databases like MongoDB and CockroachDB have built-in automatic sharding capabilities.

4. Implementation Steps (Using Hash-based Sharding as an Example)

Scenario: Split the users table into 4 shards based on user_id.

Step 1: Design the Shard Key

  • Choose a frequently queried field as the shard key (e.g., user_id) to avoid cross-shard queries.

Step 2: Create Sharding Rules

-- Shard 1: hash(user_id) % 4 = 0  
-- Shard 2: hash(user_id) % 4 = 1  
-- ...  

Step 3: Data Migration Plan

  • Dual Writes: Write to both old and new shards simultaneously, gradually migrating historical data.
  • Downtime Migration: Suspend service, bulk export/import data, then switch.

Step 4: Handle Cross-Shard Operations

  • Cross-Shard Queries: Queries on non-shard keys (e.g., username) require aggregating results from all shards (often handled automatically by middleware).
  • Distributed Transactions: Use XA protocol or Saga pattern to ensure consistency.

5. Challenges and Optimizations in Sharding

  1. Cross-Shard Joins
    • Avoid join queries, or solve them via denormalized fields or data assembly at the business layer.
  2. Globally Unique IDs
    • Cannot rely on database auto-increment IDs in a sharded environment. Use Snowflake algorithm, UUID, etc., to generate IDs.
  3. Scaling and Rebalancing
    • Pre-plan the number of shards (e.g., based on powers of 2) to reduce data migration during scaling.

6. Practical Case: E-commerce Order Table Sharding

  • Shard Key: order_id (hash-based sharding).
  • Routing Logic: hash(order_id) % 8 maps to 8 shards.
  • Query Optimization: The order details page queries directly by order_id, accessing only one shard. The user order list page queries by user_id, which requires storing user_id as a redundant field or using a user-order mapping table.

Through the above steps, sharding technology can effectively enhance database scalability and performance, but requires balancing business complexity and distributed system costs.