Database Sharding Strategy and Practice

Database Sharding Strategy and Practice

Description
Database sharding is a core technical solution for massive data storage and high-concurrency access. When a single table's data volume becomes too large (e.g., exceeding tens of millions of rows) or a database instance reaches its performance bottleneck, data sharding is required to distribute data across multiple databases or tables. The topic requires mastery of common sharding strategies, applicable scenarios, implementation steps, and potential challenges (such as cross-shard queries, transaction consistency, etc.).

Knowledge Explanation

Why Sharding?
- Single Database Bottleneck: Limited resources such as disk I/O, CPU, memory, and connection count affect read/write performance.
- Single Table Bottleneck: Excessive data volume increases index depth, slowing queries; maintenance becomes difficult (e.g., ALTER TABLE takes too long).
- Goal: Improve scalability through horizontal partitioning to achieve load balancing.
Classification of Sharding Strategies
- Vertical Database Partitioning: Split by business modules (e.g., user database, order database) to reduce pressure on a single database.
- Vertical Table Partitioning: Split wide tables into cold and hot fields (e.g., user basic info table + user details table) to improve query efficiency for high-frequency fields.
- Horizontal Database Sharding: Distribute data from the same table to different database instances based on rules (e.g., hashing by user ID).
- Horizontal Table Sharding: Split single-table data into multiple homogeneous tables based on rules (e.g., order tables partitioned by month).
Shard Key Selection and Sharding Algorithms
- Principles for Shard Key Selection:
  - High-frequency query conditions (e.g., user ID, order time) to avoid cross-shard queries.
  - Data uniformity (avoid hot shards).
- Common Sharding Algorithms:
  - Hash Modulo: Shard Number = hash(shard key) % total shards.
    - Advantages: Even data distribution.
    - Disadvantages: Data redistribution required during expansion (can be mitigated with consistent hashing).
  - Range Sharding: Partition by shard key intervals (e.g., by time range).
    - Advantages: Easy expansion and range queries.
    - Disadvantages: May cause data skew (e.g., high access density on the latest shard).
  - Gene Method: Embed shard information in the shard key (e.g., preset shard number in the last few digits of user ID).
Sharding Implementation Steps
- Step 1: Requirement Assessment
  - Analyze data growth trends and query patterns (e.g., read/write ratio, hot data).
  - Define sharding goals (e.g., support billions of records, tens of thousands of queries per second).
- Step 2: Design Sharding Solution
  - Select shard key and algorithm (e.g., order table split into 64 tables via order_id % 64).
  - Establish routing rules: Implement routing via middleware (e.g., ShardingSphere) or client SDK.
- Step 3: Data Migration and Synchronization
  - Dual-write scheme: Write simultaneously to the old database and new sharded databases, gradually migrating historical data.
  - Downtime migration: Suitable for off-peak periods, using ETL tools for full synchronization.
- Step 4: Application Refactoring
  - Modify the DAO layer to avoid direct cross-shard queries (e.g., change WHERE user_id IN (...) to multiple queries).
  - Introduce connection pool management for multiple data sources.
Challenges and Solutions
- Cross-Shard Queries:
  - Solution 1: Aggregation layer merging (e.g., aggregating query results in memory).
  - Solution 2: Build global index tables (e.g., mapping user IDs to shards).
- Distributed Transactions:
  - Use flexible transactions (e.g., Saga pattern, TCC) or rely on message queues for eventual consistency.
- Expansion and Rebalancing:
  - Pre-sharding (e.g., initial design of 1024 shards, physical databases expanded gradually).
  - Dynamic migration tools (e.g., Vitess's VReplication).

Conclusion
Sharding is a systematic project that requires selecting sharding strategies based on business characteristics, along with supporting monitoring and operational tools to ensure stability. In practice, it is often combined with technologies like read/write separation and caching to form a complete architecture.