Summary of distributed ID schemes

ID is the unique identifier of data. The traditional way is to use UUID and the increment ID of the database. In Internet enterprises, most companies use Mysql, and Innodb storage engine is usually used because transaction support is required. However, with the development of the company’s business, the amount of data will become larger and larger, and the data in each table needs to be divided into tables. After the table is divided, the data in each table will be increased at its own pace, and ID conflicts are likely to occur. A separate mechanism is needed to generate unique ids, which can also be called distributed ids or global ids. Let’s examine the various mechanisms for generating distributed ids.

The ID of the database is automatically added

The first option, again based on the increment ID of the database, requires a separate database instance and creates a separate table in that instance:

The table structure is as follows:

CREATE DATABASE `SEQID`;

CREATE TABLE SEQID.SEQUENCE_ID (
	id bigint(20) unsigned NOT NULL auto_increment, 
	stub char(10) NOT NULL default '',
	PRIMARY KEY (id),
	UNIQUE KEY stub (stub)
) ENGINE=MyISAM;
Copy the code

You can use the following statement to generate and obtain an increment ID

begin;
replace into SEQUENCE_ID (stub) VALUES ('anyword');
select last_insert_id();
commit;
Copy the code

Stub fields do not have any special meaning here, but are used to insert data conveniently. Only when data can be inserted can an auto-increment ID be generated. For insert, we use replace. Replace checks whether there is any data with the same value as the stub. If there is, delete and insert.

The generating mechanism of distributed ids, which requires a separate Mysql instance, although possible, but to consider based on the performance and reliability is not enough, every time a business system needs an ID, all need to request database access, performance is low, and if the database instance offline, so will affect all of the business system.

To solve the database reliability problem, we can use a second distributed ID generation scheme.

Database multi-master mode

If the two databases form a master-slave cluster, the reliability of the database can be solved normally. However, if the primary database fails and the data is not synchronized to the secondary database in time, the phenomenon of ID duplication will occur. We could use a dual-master cluster, where both Mysql instances can produce a separate increment ID, which would be more efficient, but without any other modifications, the two Mysql instances would probably produce the same ID. Each Mysql instance needs to be individually configured with a different starting value and increment step size.

Mysql > select * from ‘Mysql’;

set @@auto_increment_offset = 1; Set @@auto_increment_increment = 2; - stepCopy the code

Mysql > select * from ‘Mysql’;

set @@auto_increment_offset = 2; Set @@auto_increment_increment = 2; - stepCopy the code

After the above configuration, the two Mysql instances generate the following sequence of ids:

Mysql1, start value 1, step size 2,ID generated sequence: 1,3,5,7,9,…

Mysql2, start value 2, step size 2,ID generated sequence: 2,4,6,8,10,…

In this distributed ID generation scheme, you need to add a distributed ID generation application, such as DistributIdService. This application provides an interface for service applications to obtain an ID. When a service application needs an ID, it requests the DistributIdService in RPC mode. DistributIdService randomly fetches ids from the two Mysql instances above.

If one of the Mysql instances goes offline, the DistributIdService will not be affected. DistributIdService can still use another Mysql to generate ids.

However, the scalability of this scheme is not good. If two Mysql instances are insufficient, it will be troublesome to add Mysql instances to improve performance.

Now what if I wanted to add an instance mysql3?

First, mysqL1, mysql2 step must be changed to 3, and can only be manually to change, which takes time.

Second, since mysql1 and mysql2 are constantly incrementing, we may want to set the starting value of mysql3 to be larger to give enough time to change the step size of mysql1 and mysql2.

Third, it is likely that duplicate ids will occur when changing step sizes, which may require downtime to resolve.

To solve the above problems, and to further improve the performance of DistributIdService, use a third mechanism to generate distributed ids.

Them roughly mode

For example, when the DistributIdService obtains ids from the database, it can obtain multiple ids in a batch and cache them locally. In this way, service applications can obtain ids more efficiently.

For example, every time the DistributIdService obtains an ID from the database, it obtains a number segment, such as (1,1000). This range represents 1000 ids. When a business application requests a DistributIdService ID, The DistributIdService increments locally from 1 and returns instead of requesting the database every time. When the local increments to 1000, the current number segment is used up, the database obtains the next number segment again.

Therefore, we need to make changes to the database table as follows:

CREATE TABLE ID_generator (ID int(10) NOT NULL, current_MAX_ID bigINT (20) NOT NULL COMMENT 'current maximum ID ', Increment_step int(10) NOT NULL COMMENT 'id ', PRIMARY KEY (' id')) ENGINE=InnoDB DEFAULT CHARSET=utf8;Copy the code

This database table is used to record the increment step size and the maximum value of the increment ID currently (that is, the last value of the number segment that has been requested). The increment logic is no longer needed because it has been moved to the DistributIdService.

This scheme no longer relies heavily on the database, and even if the database is not available, the DistributIdService can continue to support it for a while. However, if the DistributIdService restarts, one ID will be lost, resulting in a null ID.

To improve the high availability of the DistributIdService cluster, you need to create a cluster. When a service requests the DistributIdService cluster to obtain its ID, it randomly selects a DistributIdService node. If each DistributIdService node is connected to the same database, multiple DistributIdService nodes may request the database to obtain number segments at the same time. In this case, optimistic locks are required. For example, to add a version field to a database table, use the following SQL to fetch the number segment:

update id_generator set current_max_id=#{newMaxId}, version=version+1 where version = #{version}
Copy the code

Because newMaxId is calculated from the oldMaxId+ step in the DistributIdService, the number segment is obtained if the update above is successful.

In order to provide high availability at the database level, the database needs to be deployed in multi-master mode. To ensure that the number segment generated for each database is not repeated, this needs to use the original idea, and then increase the starting value and step size in the database table. For example, if there are two Mysql tables, then

Mysql1 will generate the number segment (1,1001), incremented with the sequence 1,3,4,5,7….

Mysql1 will generate the number segment (2,1002), incremented with the sequence 2,4,6,8,10…

For more details, see Didi’s open source TinyId

A further step has been added to TinyId to improve efficiency. In the above implementation, the logic for increment of ID is implemented in DistributIdService, but you can actually move the increment logic to the business application, so that the business application only needs to get the number segment. The request to call DistributIdService is no longer required each increment.

Snowflakes algorithm

The above three methods are generally based on the idea of augmentation, and the next one is the famous snowflake algorithm – Snowflake.

We can think of distributed ids in a different way, as long as each machine responsible for generating distributed ids produces a different ID every millisecond.

Snowflake is an open source distributed ID generation algorithm for Twitter. It is an algorithm, so unlike the above three distributed ID generation mechanisms, it does not rely on a database.

The core idea is: distributed ID is always a long number, and a long is 8 bytes, that is, 64 bits. The bit allocation in the original Snowflake algorithm is as follows:

The first bit is the identification part. In Java, because the highest bit of long is the sign bit, the positive number is 0, and the negative number is 1. Generally, the generated ID is a positive number, so it is fixed to 0.

The timestamp part is 41bit, this is the millisecond level of time, the general implementation does not store the current timestamp, but the difference of the timestamp (current time – fixed start time), so that the generated ID starts from a smaller value; The 41-bit timestamp will last 69 years, (1L << 41)/(1000L * 60 * 60 * 24 * 365) = 69 years

The number of working machine ids is 10 bits. This parameter is flexible. For example, you can use the first five digits as the id of the data center equipment room and the last five digits as the id of the single machine room to deploy 1024 nodes.

The serial number is 12 bits, and 4096 ids can be generated for the same node in the same millisecond

According to the logic of the algorithm, this algorithm only needs to be implemented in the Java language, encapsulation method as a tool, so the business application can directly use the tool method to get distributed ids, only need to ensure that each business application has its own working machine ID can, without the need for a separate to set up an access to the application of distributed ids.

The Snowflake algorithm is not difficult to implement, but provides a Java implementation on Github:

Github.com/beyondfengy…

Instead of using Snowflake directly, snowflake has been modified, because the most difficult part of the Snowflake algorithm to implement is the work machine ID. The original Snowflake algorithm had to manually assign a machine ID to each machine. And configure the location where Snowflake gets the machine ID.

But in big factories, machines were too expensive and error-prone, so big factories transformed Snowflake.

Baidu (UID-Generator)

Github address: github.com/baidu/uid-g…

Uid-generator uses snowflake, except for producing machine ids, also known as workIDS.

The workId in UID-Generator is automatically generated by uID-generator, and considering that the application is deployed on docker, users can define the generation strategy of workId in UID-Generator by themselves. The default policy is as follows: The application is allocated by the database at startup. To put it simply: when the application is started, it inserts a data entry into the database table (uID-Generator requires a new WORKER_NODE table). After the data is successfully inserted, the corresponding self-added unique ID returned is the workId of the machine, and the data consists of host and port.

For the workId in UID-generator, there are 22 bits, 28 bits for the time, and 13 bits for serialization. Note that unlike the original Snowflake, the time is in seconds, not milliseconds, and the workId is not the same. The same application consumes a workId every time it restarts.

Please refer to github.com/baidu/uid-g…

Meituan (Leaf)

Github address: Leaf

Meituan’s Leaf is also a distributed ID generation framework. It’s very comprehensive and supports both numbered segment mode and Snowflake mode. Segment patterns are not covered here, similar to the above analysis.

The difference between the Snowflake mode in Leaf and the original Snowflake algorithm is also mainly in the generation of workId. The workId in Leaf is generated based on the sequence Id of ZooKeeper. When each application uses Leaf-Snowflake, A sequence Id is generated in Zookeeper during startup, which is equivalent to a sequence node corresponding to a machine, namely a workId.

conclusion

In general, the above two automatically generate workids to make the system more stable and reduce manual success.

Redis

The incr command in Redis can be used to implement atomic increment and return. For example, the incr command in Redis is used to implement atomic increment and return.

127.0.0.1:6379> set seq_id 1 OK 127.0.0.1:6379> incr seq_id And return (integer) 2 127.0.0.1:6379> incr seq_id // Increment 1, and return (integer) 3Copy the code

Redis is very efficient, but there are persistence issues to consider. Redis supports both RDB and AOF persistence.

RDB persistence is equivalent to creating a snapshot periodically for persistence. If a snapshot is created for several times in a row after the snapshot is created, the Redis fails and the ID of the snapshot is repeated after the Redis is restarted.

AOF persistence is equivalent to the persistence of each write command. If Redis fails, the phenomenon of ID duplication will not occur, but the incR command will pass, resulting in a long time to restart data recovery.

The ID of the database is automatically added

Database multi-master mode

Them roughly mode

Snowflakes algorithm

Baidu (UID-Generator)

Meituan (Leaf)

conclusion

Redis

Related Posts

Investigation of serrated fluctuation in pressure measurement

In-depth understanding of K8S network principles

Golang operates the MySql database