Design and Practice of Horizontal Sharding and Vertical Sharding for Databases
Problem Description
Horizontal Sharding (Sharding) and Vertical Sharding are core strategies for database partitioning (splitting databases and tables), used to address bottlenecks in single-node databases under pressure from data volume, concurrency, or storage. Interviews often require comparing their design philosophies, applicable scenarios, and practical details, as well as analyzing how to choose sharding keys and handle challenges like cross-shard queries.
1. Vertical Sharding
Core Idea
Split a table into multiple tables with different structures based on business modules or column relationships, and distribute them to different databases or tables.
Example: A user table contains basic information (user_id, name) and extended information (address, hobbies). The extended information is split into a separate table.
Detailed Steps
-
Analyze Data Access Patterns
- Identify frequently queried columns (e.g., basic user info) vs. infrequently accessed columns (e.g., user details).
- Examine data hot/cold characteristics (e.g., recent orders vs. historical orders).
-
Design Splitting Rules
- Group by Columns: Keep high-frequency access columns in the main table; move low-frequency or large-field columns (e.g., BLOBs) to secondary tables.
- Foreign Key Association: Tables after splitting are linked via primary keys (e.g., user_id) to ensure data consistency.
-
Pros and Cons Analysis
- Pros: Reduces single-table width, improves performance for high-frequency queries; separates hot/cold data to reduce I/O pressure.
- Cons: Requires application layer to handle join queries; increases complexity of cross-table transactions.
2. Horizontal Sharding
Core Idea
Split the data of the same table by rows into multiple databases or tables with identical structures.
Example: An orders table is distributed across 4 databases based on the hash value of the order ID.
Detailed Steps
-
Choose a Sharding Key
- Principles: Ensures even data distribution (avoid hotspots), is frequently used in queries (e.g., order ID), and minimizes cross-shard queries.
- Common Strategies:
- Hash-based Sharding:
Shard Number = hash(Sharding Key) % Total Shards. Ensures even distribution but makes range queries difficult. - Range-based Sharding: Divides data by intervals like time or ID (e.g., one shard per month). Easy to expand but may cause data skew.
- Geo-based Sharding: Assigns data based on user region to optimize local access.
- Hash-based Sharding:
-
Design Sharding Routing
- Client-side Routing: The application layer calculates the target shard (e.g., using Sharding-JDBC).
- Middleware Routing: A proxy layer (e.g., MyCat) parses SQL and routes queries.
- Global Index Table: Maintains a mapping between sharding keys and shard locations, suitable for dynamic scaling.
-
Handle Cross-Shard Issues
- Query Merging: For cross-shard queries (e.g., SUM), the middleware aggregates results.
- Distributed Transactions: Use Two-Phase Commit (2PC) or eventual consistency solutions (e.g., TCC).
3. Comparison and Selection Principles
| Aspect | Vertical Sharding | Horizontal Sharding |
|---|---|---|
| Splitting Unit | Columns (table structure changes) | Rows (table structure unchanged) |
| Applicable Scenarios | Tables with many columns, distinct hot/cold data | Large data volume, high concurrency |
| Scalability | Limited (single-table data volume not reduced) | Strong (can scale indefinitely) |
| Complexity | Low (no cross-shard queries needed) | High (requires handling routing, cross-shard transactions) |
Selection Advice:
- First apply vertical sharding to optimize single-table structure; if data volume remains large, then apply horizontal sharding.
- For horizontal sharding, plan the sharding key in advance to avoid difficulties in future data migration.
4. Practical Case: E-commerce Orders Table Sharding
- Vertical Sharding:
- Split into an orders base table (order_id, user_id, amount) and an orders detail table (order_id, product_list, logistics).
- Horizontal Sharding:
- Use hash-based sharding on order_id, adding new shard databases annually to handle data growth.
- Query Optimization:
- Redundantly store user order queries on separate shards by user_id to avoid cross-database joins.
Summary
Horizontal and vertical sharding are cornerstones of distributed database design. They must be applied flexibly based on business characteristics. Vertical sharding focuses on decoupling data relationships, while horizontal sharding addresses scale-out challenges. In practice, they are often combined to achieve a balance between performance and maintainability.