Database Sharding Strategy and Implementation

Database Sharding Strategy and Implementation

1. What is Database Sharding?

Database sharding is a technique that horizontally splits a large database into smaller, more manageable parts (called "shards"). Each shard independently stores a portion of the data, and all shards together constitute the complete logical database. The goal of sharding is to address the performance bottlenecks of a single-machine database when facing massive data volumes or high concurrent requests.

2. Why Sharding is Needed?

Single Machine Capacity Limit: Disk space, memory, or CPU cannot support massive data.
Performance Pressure: High-concurrency reads and writes lead to increased I/O or lock contention.
Availability Requirements: The risk of single point of failure needs to be mitigated through a distributed architecture.

Comparison with Database and Table Partitioning:

Database and table partitioning are specific implementations of sharding; sharding places more emphasis on "how to partition data and route queries."

3. Key Problems and Solutions in Sharding

Problem 1: How to Partition Data?

Common Sharding Strategies:

Range-based Sharding
- Partition data based on a continuous range of a specific field (e.g., user IDs 1-1,000,000 go to shard 1, 1,000,001-2,000,000 to shard 2).
- Advantages: Efficient for range queries (e.g., BETWEEN 1 AND 50 only accesses one shard).
- Disadvantages: May create data hotspots (e.g., new users concentrated in the last shard).
Hash-based Sharding
- Calculate a hash value for the shard key (e.g., user ID) and assign based on modulo operation (e.g., hash(user_id) % 4 determines the shard).
- Advantages: Even data distribution, avoiding hotspots.
- Disadvantages: Range queries require scanning all shards, leading to low efficiency.
Consistent Hashing
- Addresses the issue of massive data migration during scaling (adding/removing shards). Only affects data in adjacent shards.

Problem 2: How to Route Queries?

Sharding Architecture Types:

Client-side Sharding: The application layer directly calculates the target shard and connects to the corresponding database (e.g., ShardingSphere-JDBC).
Proxy-based Sharding: Routes queries to shards through middleware (e.g., MyCat, ProxySQL) that parses SQL, transparent to the application.
Database-native Sharding: Databases like MongoDB and CockroachDB have built-in automatic sharding capabilities.

4. Implementation Steps (Using Hash-based Sharding as an Example)

Scenario: Split the users table into 4 shards based on user_id.

Step 1: Design the Shard Key

Choose a frequently queried field as the shard key (e.g., user_id) to avoid cross-shard queries.

Step 2: Create Sharding Rules

-- Shard 1: hash(user_id) % 4 = 0  
-- Shard 2: hash(user_id) % 4 = 1  
-- ...

Step 3: Data Migration Plan

Dual Writes: Write to both old and new shards simultaneously, gradually migrating historical data.
Downtime Migration: Suspend service, bulk export/import data, then switch.

Step 4: Handle Cross-Shard Operations

Cross-Shard Queries: Queries on non-shard keys (e.g., username) require aggregating results from all shards (often handled automatically by middleware).
Distributed Transactions: Use XA protocol or Saga pattern to ensure consistency.

5. Challenges and Optimizations in Sharding

Cross-Shard Joins
- Avoid join queries, or solve them via denormalized fields or data assembly at the business layer.
Globally Unique IDs
- Cannot rely on database auto-increment IDs in a sharded environment. Use Snowflake algorithm, UUID, etc., to generate IDs.
Scaling and Rebalancing
- Pre-plan the number of shards (e.g., based on powers of 2) to reduce data migration during scaling.

6. Practical Case: E-commerce Order Table Sharding

Shard Key: order_id (hash-based sharding).
Routing Logic: hash(order_id) % 8 maps to 8 shards.
Query Optimization: The order details page queries directly by order_id, accessing only one shard. The user order list page queries by user_id, which requires storing user_id as a redundant field or using a user-order mapping table.

Through the above steps, sharding technology can effectively enhance database scalability and performance, but requires balancing business complexity and distributed system costs.