Global Unique ID Generation Solutions for Database Horizontal Sharding

Global Unique ID Generation Solutions for Database Horizontal Sharding

Problem Description
In a horizontally sharded database architecture, traditional auto-increment primary keys (such as MySQL's AUTO_INCREMENT) will generate duplicate IDs across different physical nodes, breaking global uniqueness. Please design an ID generation solution that ensures global uniqueness while also considering performance, monotonic ordering, and scalability. Explain its principles and implementation details.

Solution Process

1. Problem Analysis

Limitations of Auto-increment Keys: Auto-increment keys in a single database rely on a centralized counter. After sharding, each node counts independently, leading to ID duplication.
Core Requirements:
- Global Uniqueness: No duplicate IDs across the entire distributed system.
- High Performance: Low latency, high throughput, avoiding becoming a system bottleneck.
- Trend Ordering: IDs should be roughly incremental to optimize database indexing (e.g., sequential inserts in B+ trees).
- Scalability: Support horizontal scaling.

2. Common Solutions Comparison

Solution	Principle	Advantages	Disadvantages
UUID	Randomly generates a 128-bit string	Simple, no centralized dependency	Non-sequential leading to index fragmentation, large storage space
Database Segment Mode	Batch requests for auto-increment ID ranges	Reduces database pressure, trend ordering	Depends on database, potential blocking when segment is exhausted
Snowflake Algorithm	Timestamp + Machine ID + Sequence Number	High performance, ordered, decentralized	Relies on system clock (clock drift/rollback issue)

3. Detailed Explanation of Snowflake Algorithm

ID Structure (64-bit binary):

0 | 0000000000 0000000000 0000000000 0000000000 0 | 0000000000 | 000000000000  
└─────────────┬─────────────┘ └─────┬──────┘ └──────┬─────┘  
    41-bit Timestamp (ms)       10-bit Machine ID     12-bit Sequence Number

Timestamp: Current time minus a custom epoch (e.g., 2020-01-01), usable for about 69 years.
Machine ID: Assigned via configuration or service discovery, supports up to 1024 nodes.
Sequence Number: Auto-increment counter within the same millisecond, supports 4096 IDs per node per millisecond.

Ordering Principle:
The timestamp occupies the high-order bits, ensuring IDs increase over time. Sequence numbers guarantee order within the same millisecond.
Handling Clock Rollback:
- Minor rollback (e.g., ≤100ms): Wait for the clock to catch up.
- Severe rollback: Raise an alarm and refuse to generate IDs, or switch to a backup node.

4. Optimized Variants

Leaf-Segment (Optimized Database Segment Mode):
- Fetch a segment of IDs (e.g., 1~1000) from the database each time and allocate them from memory.
- Pre-fetch the next segment asynchronously using a double Buffer to avoid allocation blocking.
Leaf-Snowflake:
- Introduce ZooKeeper to coordinate machine ID allocation, handling machine ID switching during clock rollback.

5. Implementation Example (Java Pseudocode for Snowflake)

public class SnowflakeIdGenerator {  
    private final long startTime = 1609459200000L; // 2020-01-01  
    private final int machineIdBits = 10;  
    private final int sequenceBits = 12;  
    private final long maxMachineId = (1L << machineIdBits) - 1;  
    private final long maxSequence = (1L << sequenceBits) - 1;  
    private final long machineId;  
    private long lastTimestamp = -1L;  
    private long sequence = 0L;  

    public synchronized long nextId() {  
        long currentTime = System.currentTimeMillis();  
        if (currentTime < lastTimestamp) {  
            throw new RuntimeException("Clock moved backwards!");  
        }  
        if (currentTime == lastTimestamp) {  
            sequence = (sequence + 1) & maxSequence;  
            if (sequence == 0) {  
                // Sequence exhausted for current millisecond, wait for next millisecond  
                currentTime = waitNextMillis(currentTime);  
            }  
        } else {  
            sequence = 0L;  
        }  
        lastTimestamp = currentTime;  
        return ((currentTime - startTime) << (machineIdBits + sequenceBits))  
                | (machineId << sequenceBits)  
                | sequence;  
    }  
}

6. Summary

Small-scale Systems: Prioritize the Database Segment Mode for its simplicity and reliability.
Large-scale High-concurrency Systems: Snowflake or its variants are more suitable, but clock synchronization issues must be addressed.
Hybrid Solutions: Combine aspects like batch segment requests from the Segment Mode with the decentralized nature of Snowflake (e.g., Meituan's Leaf).