Database Sharding and Partitioning Strategies and Practices

Database Sharding and Partitioning Strategies and Practices

Problem Description
Database sharding and partitioning are key techniques for addressing database performance bottlenecks. When a single table's data volume becomes excessively large or concurrent access pressure is too high, data splitting is used to enhance system performance. Please explain in detail the common strategies for database sharding/partitioning, applicable scenarios, and critical issues encountered in practice.

Detailed Knowledge Points

1. Why Do We Need Database Sharding/Partitioning?

  • Single Database Bottleneck: When data volume reaches tens of millions or more, single-table indexes bloat, causing query performance to degrade sharply.
  • Single Machine Resource Limits: Resources such as disk I/O, CPU, and memory become bottlenecks, affecting concurrent processing capabilities.
  • High Concurrency Pressure: A large number of requests concentrated on a single database instance can make the number of connections a bottleneck.

2. Core Strategies for Sharding/Partitioning
Sharding/partitioning is divided into two directions: vertical splitting and horizontal splitting:

2.1 Vertical Splitting (by Business Dimension)

  • Vertical Database Sharding: Split tables of different business modules into different database instances (e.g., user database, order database).
  • Vertical Table Partitioning: Split columns of a single table based on access frequency (e.g., place frequently used fields in the main table, and large or infrequently accessed fields in an extension table).
  • Example: Splitting the user table into user_base (basic info) and user_profile (detailed info).

2.2 Horizontal Splitting (by Data Dimension)

  • Horizontal Database Sharding: Distribute data of the same table to different databases according to rules (e.g., based on user ID modulo to DB1, DB2).
  • Horizontal Table Partitioning: Split single-table data into multiple child tables according to rules (e.g., by time month: order_202401, order_202402).
  • Common Shard Keys: Fields with business relevance such as User ID, Order ID, geographic location.

3. Detailed Shard Routing Strategies
3.1 Range-Based Sharding

  • Partition based on continuous ranges of the shard key (e.g., IDs 1-1000 in shard 1, 1001-2000 in shard 2).
  • Advantages: Suitable for range queries, easy to scale out.
  • Disadvantages: May create data hotspots (new data concentrates in the last shard).

3.2 Hash-Based Sharding

  • Take the hash value of the shard key and then modulo the number of shards (e.g., Shard Number = hash(user_id) % 4).
  • Advantages: Even data distribution, avoids hotspots.
  • Disadvantages: Scaling requires re-hashing, range queries need cross-shard scanning.

3.3 Consistent Hashing

  • Uses a virtual node ring to reduce the amount of data migration during scaling (only affects adjacent shards).
  • Example: When scaling from 4 to 5 shards, only about 20% of data needs migration.

4. Critical Issues in Practice
4.1 Global ID Generation

  • Problem with Database Auto-increment IDs: May duplicate after sharding.
  • Solutions:
    • Snowflake Algorithm: Generates 64-bit IDs (timestamp + machine ID + sequence number).
    • Segment Mode: Batch-fetch ID segments from the database (e.g., allocate 1000 IDs at once).

4.2 Cross-Shard Queries

  • Query Condition Lacks Shard Key: Requires sending queries to all shards and aggregating results (poor performance).
  • Solutions: Create heterogeneous index tables (e.g., mapping tables based on non-shard keys), or use search engines (e.g., Elasticsearch).

4.3 Distributed Transactions

  • Scenario: Update operations across shards (e.g., fund transfer involving two users on different shards).
  • Solutions: Use XA protocol, TCC pattern, or eventual consistency based on message queues.

5. Database Sharding Middleware

  • Proxy Mode: Add a proxy layer between the application and database (e.g., MyCat, ShardingSphere-Proxy), transparent to the application.
  • Client Mode: Integrate an SDK at the application layer (e.g., Sharding-JDBC), higher performance but requires code changes.

Summary Recommendations

  • Prioritize SQL and index optimization; consider sharding only when single-table data exceeds tens of millions.
  • Choose shard keys considering both data distribution and query patterns.
  • Design scaling-out plans in advance to avoid downtime during migration.