ID Generation Strategies Under Database Horizontal Sharding

ID Generation Strategies Under Database Horizontal Sharding

Problem Description:
In distributed database systems, after adopting horizontal sharding (database and table partitioning) technology, how to generate a globally unique ID identifier for each record becomes a critical issue. Traditional single-machine database auto-increment ID mechanisms face problems such as duplicate IDs and performance bottlenecks in distributed environments. Please elaborate in detail on common solutions for distributed ID generation and their principles.

Detailed Knowledge Points:

1. Problem Background and Challenges
In single-machine databases, we typically use the database's auto-increment field (such as MySQL's AUTO_INCREMENT) to generate primary key IDs. This method is simple, reliable, and ensures the uniqueness and incremental nature of IDs.
However, in a horizontal sharding architecture, data is distributed across multiple database instances or tables. If each shard uses local auto-increment IDs, serious problems arise:

ID Conflicts: Different shards may generate the same ID (e.g., Shard A generates ID=100, and Shard B may also generate ID=100).
Global Disorder: It is impossible to guarantee that the generated IDs are monotonically increasing on a global scale, which is detrimental to range queries or sorting by time.

Therefore, we need a solution capable of generating globally unique IDs in a distributed environment.

2. Solution One: UUID
Principle: A UUID (Universally Unique Identifier) is a 128-bit number generated by specific algorithms (e.g., based on MAC address, timestamp, random numbers, etc.), theoretically guaranteeing global uniqueness.
Advantages:

Simple to generate, requiring no central node or coordination.
Generated locally, no network overhead, high performance.
Disadvantages:
ID is too long (typically 36 characters), occupying significant storage space.
Lack of order leads to decreased database indexing efficiency (because B+Tree indexes require frequent page splits when inserting unordered data).
Poor readability.

3. Solution Two: Database Auto-increment ID Table
Principle: Use a separate database instance (or table) dedicated to generating IDs. By maintaining a global table containing an auto-increment field, other applications or shards insert an empty record into this table when they need an ID and obtain the generated auto-increment ID.
Advantages:

Ensures global uniqueness and incremental nature of IDs.
Relatively simple to implement.
Disadvantages:
This table faces the risk of a single point of failure; once it fails, the entire system cannot generate IDs.
In high-concurrency scenarios, this table can become a performance bottleneck.

4. Solution Three: Segment Mode (Segment)
Principle: An improved version of the database auto-increment ID table method. Instead of generating one ID at a time, it retrieves a segment of IDs in one go (e.g., 1~1000). The application caches this segment locally and requests a new segment after exhausting it.
Workflow:

Maintain a table in the database recording the business Tag, current maximum ID (Max_ID), and segment length (Step).
When the application starts, perform an update operation like UPDATE id_table SET Max_ID = Max_ID + Step WHERE tag = 'order' and query to obtain the updated Max_ID.
The application can then locally use IDs in the range from (Max_ID - Step + 1) to Max_ID.
Advantages:

Significantly reduces database pressure (from retrieving one ID per request to retrieving a batch per request).
If the database fails, the application's locally cached segment can still be used for a period, improving fault tolerance.
Disadvantages:
Requires the application to maintain the local segment, making implementation slightly more complex.
If the application restarts, unconsumed segments are wasted, leading to non-consecutive IDs.

5. Solution Four: Snowflake Algorithm
Principle: An open-source distributed ID generation algorithm from Twitter. It generates a 64-bit long integer with the following structure:

1-bit sign bit: Fixed at 0.
41-bit timestamp: Records the time (in milliseconds) when the ID was generated, usable for about 69 years.
10-bit worker machine ID: Typically 5-bit data center ID + 5-bit machine ID, supporting up to 32 data centers with 32 machines each.
12-bit sequence number: Sequence number for different IDs generated within the same millisecond, supporting 4096 IDs per machine per millisecond.
Advantages:
Generated locally, extremely high performance.
IDs are trend-incremental, beneficial for database indexing.
Flexible allocation of bit lengths according to business needs.
Disadvantages:
Strong dependence on machine clock; if clock rollback occurs, it may lead to duplicate IDs (requires handling clock rollback issues within the algorithm).
Worker machine IDs need to be pre-configured to ensure global uniqueness.

6. Solution Comparison and Selection Recommendations

Small-scale, low-concurrency systems: Consider UUID or the database auto-increment ID table.
Medium to high-concurrency systems requiring ordered IDs: Segment mode or the Snowflake algorithm are better choices.
Scenarios with low sensitivity to clock rollback: The Snowflake algorithm is widely adopted for its high performance and ordered nature.
Scenarios requiring higher fault tolerance and scalability: Segment mode can be combined with ZooKeeper/Etcd for dynamic segment management, enabling more flexible allocation.

By understanding the principles and applicable scenarios of these strategies, you can select the most appropriate distributed ID generation solution based on actual business requirements (such as concurrency volume, whether IDs need to be ordered, system fault tolerance requirements, etc.).