Database Sharding Strategy and Implementation
1. What is Database Sharding?
Database sharding is a technique that horizontally splits a large database into smaller, more manageable parts (called "shards"). Each shard independently stores a portion of the data, and all shards together constitute the complete logical database. The goal of sharding is to address the performance bottlenecks of a single-machine database when facing massive data volumes or high concurrent requests.
2. Why Sharding is Needed?
- Single Machine Capacity Limit: Disk space, memory, or CPU cannot support massive data.
- Performance Pressure: High-concurrency reads and writes lead to increased I/O or lock contention.
- Availability Requirements: The risk of single point of failure needs to be mitigated through a distributed architecture.
Comparison with Database and Table Partitioning:
- Database and table partitioning are specific implementations of sharding; sharding places more emphasis on "how to partition data and route queries."
3. Key Problems and Solutions in Sharding
Problem 1: How to Partition Data?
Common Sharding Strategies:
-
Range-based Sharding
- Partition data based on a continuous range of a specific field (e.g., user IDs 1-1,000,000 go to shard 1, 1,000,001-2,000,000 to shard 2).
- Advantages: Efficient for range queries (e.g.,
BETWEEN 1 AND 50only accesses one shard). - Disadvantages: May create data hotspots (e.g., new users concentrated in the last shard).
-
Hash-based Sharding
- Calculate a hash value for the shard key (e.g., user ID) and assign based on modulo operation (e.g.,
hash(user_id) % 4determines the shard). - Advantages: Even data distribution, avoiding hotspots.
- Disadvantages: Range queries require scanning all shards, leading to low efficiency.
- Calculate a hash value for the shard key (e.g., user ID) and assign based on modulo operation (e.g.,
-
Consistent Hashing
- Addresses the issue of massive data migration during scaling (adding/removing shards). Only affects data in adjacent shards.
Problem 2: How to Route Queries?
Sharding Architecture Types:
- Client-side Sharding: The application layer directly calculates the target shard and connects to the corresponding database (e.g., ShardingSphere-JDBC).
- Proxy-based Sharding: Routes queries to shards through middleware (e.g., MyCat, ProxySQL) that parses SQL, transparent to the application.
- Database-native Sharding: Databases like MongoDB and CockroachDB have built-in automatic sharding capabilities.
4. Implementation Steps (Using Hash-based Sharding as an Example)
Scenario: Split the users table into 4 shards based on user_id.
Step 1: Design the Shard Key
- Choose a frequently queried field as the shard key (e.g.,
user_id) to avoid cross-shard queries.
Step 2: Create Sharding Rules
-- Shard 1: hash(user_id) % 4 = 0
-- Shard 2: hash(user_id) % 4 = 1
-- ...
Step 3: Data Migration Plan
- Dual Writes: Write to both old and new shards simultaneously, gradually migrating historical data.
- Downtime Migration: Suspend service, bulk export/import data, then switch.
Step 4: Handle Cross-Shard Operations
- Cross-Shard Queries: Queries on non-shard keys (e.g.,
username) require aggregating results from all shards (often handled automatically by middleware). - Distributed Transactions: Use XA protocol or Saga pattern to ensure consistency.
5. Challenges and Optimizations in Sharding
- Cross-Shard Joins
- Avoid join queries, or solve them via denormalized fields or data assembly at the business layer.
- Globally Unique IDs
- Cannot rely on database auto-increment IDs in a sharded environment. Use Snowflake algorithm, UUID, etc., to generate IDs.
- Scaling and Rebalancing
- Pre-plan the number of shards (e.g., based on powers of 2) to reduce data migration during scaling.
6. Practical Case: E-commerce Order Table Sharding
- Shard Key:
order_id(hash-based sharding). - Routing Logic:
hash(order_id) % 8maps to 8 shards. - Query Optimization: The order details page queries directly by
order_id, accessing only one shard. The user order list page queries byuser_id, which requires storinguser_idas a redundant field or using a user-order mapping table.
Through the above steps, sharding technology can effectively enhance database scalability and performance, but requires balancing business complexity and distributed system costs.