Global Unique ID Generation Solutions for Database Horizontal Sharding

Global Unique ID Generation Solutions for Database Horizontal Sharding

Problem Description
In a horizontally sharded database architecture, traditional auto-increment primary keys (such as MySQL's AUTO_INCREMENT) will generate duplicate IDs across different physical nodes, breaking global uniqueness. Please design an ID generation solution that ensures global uniqueness while also considering performance, monotonic ordering, and scalability. Explain its principles and implementation details.

Solution Process

1. Problem Analysis

  • Limitations of Auto-increment Keys: Auto-increment keys in a single database rely on a centralized counter. After sharding, each node counts independently, leading to ID duplication.
  • Core Requirements:
    • Global Uniqueness: No duplicate IDs across the entire distributed system.
    • High Performance: Low latency, high throughput, avoiding becoming a system bottleneck.
    • Trend Ordering: IDs should be roughly incremental to optimize database indexing (e.g., sequential inserts in B+ trees).
    • Scalability: Support horizontal scaling.

2. Common Solutions Comparison

Solution Principle Advantages Disadvantages
UUID Randomly generates a 128-bit string Simple, no centralized dependency Non-sequential leading to index fragmentation, large storage space
Database Segment Mode Batch requests for auto-increment ID ranges Reduces database pressure, trend ordering Depends on database, potential blocking when segment is exhausted
Snowflake Algorithm Timestamp + Machine ID + Sequence Number High performance, ordered, decentralized Relies on system clock (clock drift/rollback issue)

3. Detailed Explanation of Snowflake Algorithm

  • ID Structure (64-bit binary):

    0 | 0000000000 0000000000 0000000000 0000000000 0 | 0000000000 | 000000000000  
    └─────────────┬─────────────┘ └─────┬──────┘ └──────┬─────┘  
        41-bit Timestamp (ms)       10-bit Machine ID     12-bit Sequence Number  
    
    • Timestamp: Current time minus a custom epoch (e.g., 2020-01-01), usable for about 69 years.
    • Machine ID: Assigned via configuration or service discovery, supports up to 1024 nodes.
    • Sequence Number: Auto-increment counter within the same millisecond, supports 4096 IDs per node per millisecond.
  • Ordering Principle:
    The timestamp occupies the high-order bits, ensuring IDs increase over time. Sequence numbers guarantee order within the same millisecond.

  • Handling Clock Rollback:

    • Minor rollback (e.g., ≤100ms): Wait for the clock to catch up.
    • Severe rollback: Raise an alarm and refuse to generate IDs, or switch to a backup node.

4. Optimized Variants

  • Leaf-Segment (Optimized Database Segment Mode):
    • Fetch a segment of IDs (e.g., 1~1000) from the database each time and allocate them from memory.
    • Pre-fetch the next segment asynchronously using a double Buffer to avoid allocation blocking.
  • Leaf-Snowflake:
    • Introduce ZooKeeper to coordinate machine ID allocation, handling machine ID switching during clock rollback.

5. Implementation Example (Java Pseudocode for Snowflake)

public class SnowflakeIdGenerator {  
    private final long startTime = 1609459200000L; // 2020-01-01  
    private final int machineIdBits = 10;  
    private final int sequenceBits = 12;  
    private final long maxMachineId = (1L << machineIdBits) - 1;  
    private final long maxSequence = (1L << sequenceBits) - 1;  
    private final long machineId;  
    private long lastTimestamp = -1L;  
    private long sequence = 0L;  

    public synchronized long nextId() {  
        long currentTime = System.currentTimeMillis();  
        if (currentTime < lastTimestamp) {  
            throw new RuntimeException("Clock moved backwards!");  
        }  
        if (currentTime == lastTimestamp) {  
            sequence = (sequence + 1) & maxSequence;  
            if (sequence == 0) {  
                // Sequence exhausted for current millisecond, wait for next millisecond  
                currentTime = waitNextMillis(currentTime);  
            }  
        } else {  
            sequence = 0L;  
        }  
        lastTimestamp = currentTime;  
        return ((currentTime - startTime) << (machineIdBits + sequenceBits))  
                | (machineId << sequenceBits)  
                | sequence;  
    }  
}  

6. Summary

  • Small-scale Systems: Prioritize the Database Segment Mode for its simplicity and reliability.
  • Large-scale High-concurrency Systems: Snowflake or its variants are more suitable, but clock synchronization issues must be addressed.
  • Hybrid Solutions: Combine aspects like batch segment requests from the Segment Mode with the decentralized nature of Snowflake (e.g., Meituan's Leaf).