background
In most business scenarios, we typically need to assign a unique ID to each piece of data as an identity. Most relational databases provide auto-increment functionality to support this requirement. If the amount of data is large, the scenario of database and table division is required. In this case, the auto-increment function provided by each database instance cannot be used because it cannot be unique in all instances. Therefore, the requirement of distributed global unique ID arises
Distributed globally unique ids must meet the following requirements:
- Global uniqueness: Unique in a service scenario to avoid data conflicts
- High performance: Build fast without blocking business processes
- Increasing by trend: This ID is usually used as the primary key of the database. Because mysql innoDB uses clustered indexes, if the newly added primary key is out of order, it may cause leaf splitting and low space utilization, reducing write performance
- Strict monotonically increasing: Applies to scenarios that require strict monotonically increasing, such as sorting by ID
- Security: to prevent leakage of business information, for example, if the generated ID is strictly increasing, the order quantity can be calculated according to the ID difference over a period of time in the e-market scene
Here are some common ID generation strategies:
UUID
The Universally Unique Identifier (UUID) is the Universally Unique Identifier. Is a 128-bit number in the following format:
// The hexadecimal format is xxxxXXXX-XXXX-XXXX-XXXXXXXXXXXXXXXXCopy the code
Historically, there have been multiple versions of the algorithm, which can be viewed as guaranteeing uniqueness
Some versions are guaranteed to be unique based on namespaces, while some versions are guaranteed to be unique based on random number generation, but the probability of the same UUID occurring is very small. Take java.util.UUID as an example, 1 billion UUID are generated every second, and the probability of only one repetition is 50% after 100 years
Both are generated locally, so the efficiency is high. The downside to
- No trend increasing feature: Poor insert performance when acting as a database primary key
- Wide data: usually, the UUID is 128bit. The mysql bigint cannot be used to store data, but varchar must be used, which affects the performance
The database auto-increment key
In mysql, for example, we can create a special table and use its auto-increment key to generate unique ids
The table structure is as follows:
CREATE TABLE unique_id (
id bigint(20) unsigned NOT NULL auto_increment,
value char(20) NOT NULL default ' '.PRIMARY KEY (id),
UNIQUE KEY unique_v(value)
) ENGINE=MyISAM;
Copy the code
And get the ID using the following statement
begin;
replace into unique_id (value) VALUES ('placeword');
select last_insert_id();
commit;
Copy the code
Replace is used instead of INSERT to ensure that there is only one record for the entire table, because you don’t need extra records to generate a self-incrementing ID
This method makes use of the self-increment primary key of the database to ensure the uniqueness of the generated ID, and strictly monotonically increases. Disadvantages:
- Performance problem: Each id generation requires database remote I/OS, which may be a performance bottleneck in high concurrency scenarios
- Availability problem: Only one instance is used, there is a single point of problem, availability is not guaranteed
For problem 2, you can introduce multiple mysql instances, each with a different initial time-incrementing step for the table
Take two mysql instances as an example, perform the following configurations:
Mysql1:
set @@auto_increment_offset = 1; Starting values -
set @@auto_increment_increment = 2; - step
Copy the code
Mysql2:
set @@auto_increment_offset = 2; Starting values -
set @@auto_increment_increment = 2; - step
Copy the code
So the id sequence generated by mysql1 is:
1,3,5,7,9...Copy the code
Mysql2 to:
2,4,6,8...Copy the code
When requests come in, these instances are requested in a random or polling way, so that the id sequence is generally increasing in trend, which not only reduces the access pressure of single instance, but also improves the availability. The disadvantages of this strategy are as follows:
- Performance issues: There is still a remote database IO for every ID generated
- Scalability issues: When you need to scale more machines, you need to adjust the step sizes of all previous instances, and you need to ensure that the ids generated during the replay do not conflict, which is more difficult to implement
redis
To solve the performance problem of database auto-increment, redis incr command can be used to generate non-repeating increment ids. Compared with the database scheme, this strategy changes from remote disk I/O to remote memory I/O, and the performance is improved to some extent. However, it takes some effort to ensure uniqueness
To discuss the various persistence strategies in turn:
If redis persistence is not enabled, the generated ID will be lost after the REDis is down, which may cause ID duplication
If the RDB or AOF persistence mode is not AOF_FSYNC_ALWAYS, the ID of the latest period may be lost, and ID duplication may occur
If the AOF_FSYNC_ALWAYS mode of AOF is enabled, the problem of ID duplication will not occur even in the current situation, but the performance will be reduced, which has no great advantage over the database scheme
Them roughly mode
Database in order to solve the key and redis strategy, every time get ID need to be the problem of the remote request, here introduce them roughly mode, that is retrieved from the database each time ID range, as a them roughly loaded into memory, so that every time when don’t need to generate a unique ID is retrieved from the database, but from a local memory, greatly improve the performance
Take mysql as an example, the database table structure is as follows:
CREATE TABLE unique_id (
id bigint(20) NOT NULL,
max_id bigint(20) NOT NULL COMMENT 'Current maximum ID',
step int(10) NOT NULL COMMENT 'Range length of number segment'.PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT
Copy the code
Each time a number segment is fetched, the following SQL is executed:
update unique_id set max_id ={max_id} + step where max_id = {max_id}
Copy the code
If affectRows is equal to 1, only the current instance can obtain the number segment of [max_id,max_id + step), and then the memory can happily send the number segment. If affectRows is not equal to 1, another instance has obtained the number segment, so you need to try again
The advantages of segment mode are as follows:
- High performance: In most cases, numbers are directly sent in memory without remote requests. The larger the number segment is, the lower the proportion of remote requests is
- High availability: If the database is down, the number segment obtained before can be used to send numbers. The larger the number segment is, the longer the number segment lasts
- Increasing trend
Its disadvantages are as follows:
- Number segment waste: If an instance is restarted before the number segment is exhausted, the remaining ID of the number segment is wasted. The solution is a smaller segment length, but as described in the benefits, this reduces performance and availability, so you need to choose a moderate segment length
- Not smooth: When the number segment is used up, a request is made to the database. If the network jitter occurs, the response to the request is slow. To resolve this situation, you can asynchronously request the database to fetch the next segment as soon as it is about to run out, rather than waiting until it is. In this way, the request database obtains the next number segment and issues the current number segment in parallel, reducing the probability of slow requests
- Availability problem: The same single point problem as the database auto-key strategy. Multiple examples are used to solve azimuth. Here, 2 examples are used as an example:
Each time a certain range of number segments are generated, and after the generation, the next start ID is set to:
Start ID + (range * number of instances)Copy the code
For example, the start ID of mysql1 is 0 and the start ID of mysql2 is 1000. When the number segment is generated next time, change the start ID of mysql to 2000 and mysql2 to 3000, and then set the start ID of mysql2 to 4000 and 5000 respectively
This ensures the uniqueness of ID, reduces the access pressure of a single database instance, and improves the availability
Snowflakes algorithm
The algorithms described above all rely on auto-increment. However, in some business scenarios, it is necessary to ensure the irregularity of ID generation. In this case, the Snowflake algorithm is often used
The Snowflake algorithm uses a 64-bit number to represent a unique ID, and how each of those 64-bit bits are used is the essence of the algorithm:
- [0:0] 1 Symbol bit: The ID bit is usually an integer. Therefore, the value of this bit is 0
- [1:41] 41 bit Time: This is usually used to indicate the time difference (current time – service start time) rather than the timestamp relative to 1970, which can support longer time, if the 41-bit timestamp is in milliseconds, it can support approximately (1 << 41)/(1000 * 60 * 60 * 24 * 365) = 69 years
- [42:51] 10 bits for machines: Of course, the first bits for the machine room and the last bits for each machine room
- [52:63] 12-bit auto-increment sequence number: Indicates the ID sequence number generated on a machine in a millisecond (if the time unit is millisecond)
The bit allocation can be adjusted according to different services. For example, if the number of machines is not so large that 10 bits are not needed, the time bit can be increased to support a longer time range. Or if the number of concurrent services is not high, you can change the time unit to second and use the saved bits for other meanings
As long as the machine ID of each instance is different, the generated ID must be different from one machine to another because its [42:51] bits are different
Its advantages are:
- Independent of database, good performance and availability
Disadvantages:
- The generation of ids is strongly dependent on the server clock. If clock rollback occurs, the generated ids may conflict with those previously generated ids
Clock rollback: The hardware clock may be incorrect for various reasons. The NETWORK provides the NTP service to calibrate the time. During the calibration, the clock may jump or dial back
- The 10-digit machine number is difficult to specify, so it is best not to specify it manually, but to obtain it automatically by instance
Clock rollback can be discussed in two cases:
- During instance running clock callback: can record last time stamp in the memory, at this time if the timestamp is smaller than last time, that happened to the clock back, can wait for a period of time and then to ID generated, if back to dial the amplitude is larger, can choose to continue to wait for, or to the upper error, because in a short period of time could not generate the correct ID.
It can also be completely independent of system time, such as Baidu’s UID-Generator, which uses an atomic variable that is added each time to generate the next time bit
- Clock rollback occurred during instance restart: The last timestamp cannot be retrieved from memory, so it needs to be placed in external storage. The solution of Meituan Leaf is to report the current timestamp to ZooKeeper every 3s. In this way, when the instance restarts, the clock can be determined whether the clock is rolled back
For the machine number generation difficulty problem: there are the following solutions:
Using ZooKeeper: When each instance is started, a node is created under ZooKeeper and its node id is used as the machine ID. Zookeeper ensures that the node id is unique
Mysql: insert a record into the database table when the instance is started, and use the auto-increment primary key as the machine ID, which can also ensure the uniqueness of the machine ID
conclusion
This paper introduces several scenarios of distributed globally unique ID schemes, analyzes the advantages and disadvantages of each scheme, and explains its solutions to the disadvantages