background

In most business scenarios, we typically need to assign a unique ID to each piece of data as an identity. Most relational databases provide auto-increment functionality to support this requirement. If the amount of data is large, the scenario of database and table division is required. In this case, the auto-increment function provided by each database instance cannot be used because it cannot be unique in all instances. Therefore, the requirement of distributed global unique ID arises

Distributed globally unique ids must meet the following requirements:

  1. Global uniqueness: Unique in a service scenario to avoid data conflicts
  2. High performance: Build fast without blocking business processes
  3. Increasing by trend: This ID is usually used as the primary key of the database. Because mysql innoDB uses clustered indexes, if the newly added primary key is out of order, it may cause leaf splitting and low space utilization, reducing write performance
  4. Strict monotonically increasing: Applies to scenarios that require strict monotonically increasing, such as sorting by ID
  5. Security: to prevent leakage of business information, for example, if the generated ID is strictly increasing, the order quantity can be calculated according to the ID difference over a period of time in the e-market scene

Here are some common ID generation strategies:

UUID

The Universally Unique Identifier (UUID) is the Universally Unique Identifier. Is a 128-bit number in the following format:

// The hexadecimal format is xxxxXXXX-XXXX-XXXX-XXXXXXXXXXXXXXXXCopy the code

Historically, there have been multiple versions of the algorithm, which can be viewed as guaranteeing uniqueness

Some versions are guaranteed to be unique based on namespaces, while some versions are guaranteed to be unique based on random number generation, but the probability of the same UUID occurring is very small. Take java.util.UUID as an example, 1 billion UUID are generated every second, and the probability of only one repetition is 50% after 100 years

Both are generated locally, so the efficiency is high. The downside to

  1. No trend increasing feature: Poor insert performance when acting as a database primary key
  2. Wide data: usually, the UUID is 128bit. The mysql bigint cannot be used to store data, but varchar must be used, which affects the performance

The database auto-increment key

In mysql, for example, we can create a special table and use its auto-increment key to generate unique ids

The table structure is as follows:

CREATE TABLE unique_id (

    id bigint(20) unsigned NOT NULL auto_increment, 

    value char(20) NOT NULL default ' '.PRIMARY KEY (id),

    UNIQUE KEY unique_v(value) 

) ENGINE=MyISAM;
Copy the code

And get the ID using the following statement

begin; 

replace into unique_id (value) VALUES ('placeword'); 

select last_insert_id(); 

commit;
Copy the code

Replace is used instead of INSERT to ensure that there is only one record for the entire table, because you don’t need extra records to generate a self-incrementing ID

This method makes use of the self-increment primary key of the database to ensure the uniqueness of the generated ID, and strictly monotonically increases. Disadvantages:

  1. Performance problem: Each id generation requires database remote I/OS, which may be a performance bottleneck in high concurrency scenarios
  2. Availability problem: Only one instance is used, there is a single point of problem, availability is not guaranteed

For problem 2, you can introduce multiple mysql instances, each with a different initial time-incrementing step for the table

Take two mysql instances as an example, perform the following configurations:

Mysql1:

set @@auto_increment_offset = 1; Starting values -

set @@auto_increment_increment = 2; - step
Copy the code

Mysql2:

set @@auto_increment_offset = 2; Starting values -

set @@auto_increment_increment = 2; - step
Copy the code

So the id sequence generated by mysql1 is:

1,3,5,7,9...Copy the code

Mysql2 to:

2,4,6,8...Copy the code

When requests come in, these instances are requested in a random or polling way, so that the id sequence is generally increasing in trend, which not only reduces the access pressure of single instance, but also improves the availability. The disadvantages of this strategy are as follows:

  1. Performance issues: There is still a remote database IO for every ID generated
  2. Scalability issues: When you need to scale more machines, you need to adjust the step sizes of all previous instances, and you need to ensure that the ids generated during the replay do not conflict, which is more difficult to implement

redis

To solve the performance problem of database auto-increment, redis incr command can be used to generate non-repeating increment ids. Compared with the database scheme, this strategy changes from remote disk I/O to remote memory I/O, and the performance is improved to some extent. However, it takes some effort to ensure uniqueness

To discuss the various persistence strategies in turn:

If redis persistence is not enabled, the generated ID will be lost after the REDis is down, which may cause ID duplication

If the RDB or AOF persistence mode is not AOF_FSYNC_ALWAYS, the ID of the latest period may be lost, and ID duplication may occur

If the AOF_FSYNC_ALWAYS mode of AOF is enabled, the problem of ID duplication will not occur even in the current situation, but the performance will be reduced, which has no great advantage over the database scheme

Them roughly mode

Database in order to solve the key and redis strategy, every time get ID need to be the problem of the remote request, here introduce them roughly mode, that is retrieved from the database each time ID range, as a them roughly loaded into memory, so that every time when don’t need to generate a unique ID is retrieved from the database, but from a local memory, greatly improve the performance

Take mysql as an example, the database table structure is as follows:

CREATE TABLE unique_id ( 

    id bigint(20) NOT NULL, 

    max_id bigint(20) NOT NULL COMMENT 'Current maximum ID', 

    step int(10) NOT NULL COMMENT 'Range length of number segment'.PRIMARY KEY (`id`) 

) ENGINE=InnoDB DEFAULT
Copy the code

Each time a number segment is fetched, the following SQL is executed:

update unique_id set max_id ={max_id} + step  where max_id = {max_id} 
Copy the code

If affectRows is equal to 1, only the current instance can obtain the number segment of [max_id,max_id + step), and then the memory can happily send the number segment. If affectRows is not equal to 1, another instance has obtained the number segment, so you need to try again

The advantages of segment mode are as follows:

  1. High performance: In most cases, numbers are directly sent in memory without remote requests. The larger the number segment is, the lower the proportion of remote requests is
  2. High availability: If the database is down, the number segment obtained before can be used to send numbers. The larger the number segment is, the longer the number segment lasts
  3. Increasing trend

Its disadvantages are as follows:

  1. Number segment waste: If an instance is restarted before the number segment is exhausted, the remaining ID of the number segment is wasted. The solution is a smaller segment length, but as described in the benefits, this reduces performance and availability, so you need to choose a moderate segment length
  2. Not smooth: When the number segment is used up, a request is made to the database. If the network jitter occurs, the response to the request is slow. To resolve this situation, you can asynchronously request the database to fetch the next segment as soon as it is about to run out, rather than waiting until it is. In this way, the request database obtains the next number segment and issues the current number segment in parallel, reducing the probability of slow requests
  3. Availability problem: The same single point problem as the database auto-key strategy. Multiple examples are used to solve azimuth. Here, 2 examples are used as an example:

Each time a certain range of number segments are generated, and after the generation, the next start ID is set to:

Start ID + (range * number of instances)Copy the code

For example, the start ID of mysql1 is 0 and the start ID of mysql2 is 1000. When the number segment is generated next time, change the start ID of mysql to 2000 and mysql2 to 3000, and then set the start ID of mysql2 to 4000 and 5000 respectively

This ensures the uniqueness of ID, reduces the access pressure of a single database instance, and improves the availability

Snowflakes algorithm

The algorithms described above all rely on auto-increment. However, in some business scenarios, it is necessary to ensure the irregularity of ID generation. In this case, the Snowflake algorithm is often used

The Snowflake algorithm uses a 64-bit number to represent a unique ID, and how each of those 64-bit bits are used is the essence of the algorithm:

  • [0:0] 1 Symbol bit: The ID bit is usually an integer. Therefore, the value of this bit is 0
  • [1:41] 41 bit Time: This is usually used to indicate the time difference (current time – service start time) rather than the timestamp relative to 1970, which can support longer time, if the 41-bit timestamp is in milliseconds, it can support approximately (1 << 41)/(1000 * 60 * 60 * 24 * 365) = 69 years
  • [42:51] 10 bits for machines: Of course, the first bits for the machine room and the last bits for each machine room
  • [52:63] 12-bit auto-increment sequence number: Indicates the ID sequence number generated on a machine in a millisecond (if the time unit is millisecond)

The bit allocation can be adjusted according to different services. For example, if the number of machines is not so large that 10 bits are not needed, the time bit can be increased to support a longer time range. Or if the number of concurrent services is not high, you can change the time unit to second and use the saved bits for other meanings

As long as the machine ID of each instance is different, the generated ID must be different from one machine to another because its [42:51] bits are different

Its advantages are:

  1. Independent of database, good performance and availability

Disadvantages:

  1. The generation of ids is strongly dependent on the server clock. If clock rollback occurs, the generated ids may conflict with those previously generated ids

Clock rollback: The hardware clock may be incorrect for various reasons. The NETWORK provides the NTP service to calibrate the time. During the calibration, the clock may jump or dial back

  1. The 10-digit machine number is difficult to specify, so it is best not to specify it manually, but to obtain it automatically by instance

Clock rollback can be discussed in two cases:

  • During instance running clock callback: can record last time stamp in the memory, at this time if the timestamp is smaller than last time, that happened to the clock back, can wait for a period of time and then to ID generated, if back to dial the amplitude is larger, can choose to continue to wait for, or to the upper error, because in a short period of time could not generate the correct ID.

It can also be completely independent of system time, such as Baidu’s UID-Generator, which uses an atomic variable that is added each time to generate the next time bit

  • Clock rollback occurred during instance restart: The last timestamp cannot be retrieved from memory, so it needs to be placed in external storage. The solution of Meituan Leaf is to report the current timestamp to ZooKeeper every 3s. In this way, when the instance restarts, the clock can be determined whether the clock is rolled back

For the machine number generation difficulty problem: there are the following solutions:

Using ZooKeeper: When each instance is started, a node is created under ZooKeeper and its node id is used as the machine ID. Zookeeper ensures that the node id is unique

Mysql: insert a record into the database table when the instance is started, and use the auto-increment primary key as the machine ID, which can also ensure the uniqueness of the machine ID

conclusion

This paper introduces several scenarios of distributed globally unique ID schemes, analyzes the advantages and disadvantages of each scheme, and explains its solutions to the disadvantages