background

In daily business development, it is usually necessary to make a unique identification of some data, such as assigning a unique ID to a large number of captured articles when entering the library, and assigning order numbers to orders placed by users, etc. When concurrency is small, the unique ID is usually the primary key ID that is incremented by the database. When there is a large amount of concurrency, some distributed ID generation schemes are considered to generate ids. Due to some special business requirements, we also use the generation of distributed ID in our business, and investigate various schemes of distributed ID. Therefore, I would like to share with you some of our research and practical optimization experience in the process of practice. Because of the length, we can mainly be divided into three parts:

【 Distributed series 01 】 Common distributed ID generation scheme analysis and big factory scheme research

Principle analysis of Open source distributed ID generation frameworks Leaf and UID-Generator

Problems and optimization of Leaf and UID-Generator

Abstract

This paper is the first part of distributed Series 01, which mainly includes two parts: the introduction of distributed ID generation scheme and the research of ID generation scheme in current large companies

I. Introduction to distributed ID generation Schemes

Currently, the commonly used distributed ID generation schemes are mainly as follows:

1. Use the UUID algorithm to generate a unique ID.

2. Use primary key auto-increment of single database to generate unique ID.

3. The primary key of multiple databases is automatically added to generate a unique ID. (Set the step size to differentiate between different databases)

4. The database is segmented to generate a unique ID. (For example, seinterfaces mode in Meituan’s Leaf framework)

5. Generate unique ids based on Snowflake algorithm (e.g. snowflake mode in Meituan’s Leaf framework, UID-generator in Baidu)

Here’s my summary:

plan advantages disadvantages Applicable scenario
Use the UUID algorithm to generate unique ids Without any dependence The ID is too long and not a number type Generate seesion_id
The unique ID is generated by primary key auto-increment of single database Convenient access, monotonically increasing Low generation efficiency, strong database dependence, ids are continuous This method is applicable to services with low concurrency.
Multi-database primary key auto-increment generates unique ID Convenient access, monotonically increasing, higher generation efficiency than single database It is not easy to expand capacity, relies heavily on the database, and ids are continuous Schema generation IDS suitable for sub-libraries and sub-tables
Generate unique IDS for database segmentation High efficiency Strongly database dependent, ids are contiguous This mode is suitable for services that have a high number of concurrent ids and whose ids are continuous without compromising information security.
Generate a unique ID based on the Snowflake algorithm High efficiency, can run without relying on other components Uneven ID distribution may cause data skew for some services Suitable for services with a high number of concurrent ID generation

1.UUID

In simple terms, a UUID is a unique ID generated by the server based on the current time, counter, hardware identifier, and so on, without any external dependencies (such as Snowflake solutions requiring a registry).

advantages

Without any dependence

Other technical solutions are dependent, such as single-machine database primary key auto-increment to generate ID strongly dependent database, Snowflake algorithm solution at least need to start the registry, Leaf framework Snowflake mode needs to upload time stamp to the registry, UUID generation does not need any external dependence,

disadvantages

The ID is too long and not a number type

Of course, the sacrifice for uniqueness is that the result is generally a 32-bit string. Because the string is too long and not numeric, it is not suitable as a primary key for a database.

(String as primary key ID, data will be inserted randomly in the clustered index, easy to cause page separation. In addition, string comparison is more expensive than numeric type, string as the primary key ID query efficiency is lower than numeric type.

Applicable scenario

Generally, it can be used as some temporary unique identifiers. For example, after a user logs in, a UUID is generated as the session ID of login, stored in Redis as the key, and Value is the information related to the user.

2. The primary key of the single-node database increases automatically

This scheme is commonly used to generate ids when traffic is not large.

advantages

Convenient access to

Zookeeper and other components may not be used in general projects, but the database is basically used, so project access is relatively simple and there is no additional maintenance cost.

Monotone increasing

Is absolutely monotonically increasing, that is, from the time line, the id generated later must be larger than the id generated earlier.

disadvantages

Id generation efficiency is limited

Because id generation depends on the primary key increment of a stand-alone database, it cannot meet the service requirements of a large number of concurrent ids.

Strong database dependence

In the event of a stand-alone database outage, ids cannot be generated and the entire system becomes unavailable. If the database is a master-slave architecture, the master database fails and is switched to the slave database. If the slave database has not received the latest update of the master database’s insert ID, the current increment ID of the slave database may not be the latest, resulting in the generation of duplicate ids.

Id is continuous

The fact that the ID is continuous may be a disadvantage. The competitor places an order at 12 o ‘clock on the same day, and then places another order at 12 o ‘clock the next day. It is possible that the daily order quantity can be estimated based on the difference of the order ID. Like cat eye movies, this method is used to generate movie ids. Generally we use daily, in fact, in order to make it harder for the way we generate id guess from his rivals, generally is not starting from 1, but the cat’s eye film begins with 1 here, and it is continuous, so we use dichotomy soon identified 8875 is one of the biggest value, which is a total of 8875 films, and due to be continuous, It’s also easier to crawl. Cat’s Eye movie ID – is also the use of stand-alone database generation, continuous increment

https://maoyan.com/films/1 https://maoyan.com/films/2 https://maoyan.com/films/8874 https://maoyan.com/films/8875 https://maoyan.com/films/8876 (can request to the data from 1 to 8875, 8876 and the back of the id request is less than the data, without these id corresponding to the film, a total of 8875 movies)Copy the code

Alter table USERS AUTO_INCREMENT= A + B; alter table users AUTO_INCREMENT= A + B; The command modifies the increment start value, i.e. skips some ids.

Applicable scenario

This method is applicable to services with low concurrency.

3. The primary key of multiple databases is automatically added to generate a unique ID. (Set the step size to differentiate between different databases)

You can run the following command to set the step size for each increment of MySQL tables. By setting the step size for different databases to be the same, ids generated by different databases can be distinguished.

CREATE TABLE table(...). AUTO_INCREMENT= n;
alter table <table name> auto_increment=2;
Copy the code

For example, the step size is equal to the number of databases. For example, if there are N databases and the first database starts at 0, the generated ids will be 0, N, 2N, 3N, and so on.

The first database starts at 1, so the generated ID is 1, N+1, 2, N+1, 3, N+1, and so on. In this way, the ids generated by each database do not conflict, and each database can generate ids separately.

advantages

Generating ids is more efficient than a single database because multiple databases can issue numbers at the same time.

disadvantages

Is to solve the problem that each increment is 1, the disadvantage is that once the step size is set, it is not convenient to expand, because the number of tables in the sub-table has been fixed.

Usage scenarios

Before there was a framework for partitioning tables, those partitioning tables were implemented using this scheme.

4. The database is segmented to generate a unique ID. (For example, seinterfaces mode in Meituan’s Leaf framework)

In simple terms, it is like the following figure, using a database table to act as a number transmitter, table fields are described as follows:

The biz_tag field is used to distinguish the services of each ID application. The max_id field records the maximum iD that can be generated. The step field indicates the number of ids that can be obtained each timeCopy the code

The id generation project uses the following statement to obtain the ID of step quantity from the database every time, and updates the value of max_id. The ID of step quantity is stored in the memory for the business side to obtain through HTTP, RPC, Client, etc.

UPDATE leaf_alloc SET max_id = max_id + step WHERE biz_tag = #{tag}
Copy the code

advantages

High efficiency

The efficiency of id generation depends on the size of the step and is not limited by the number of databases as is the case with primary key auto-generation.

disadvantages

Strong database dependence

After a database outage, although the id generation system has not run out of IDS in memory, it can maintain the normal operation of the system for a period of time, but the database is unavailable or the whole system is unavailable.

Id is continuous

Easy to be crawled, and competitors guess the order volume of a day

5. Generate a unique ID based on the Snowflake algorithm

Snowflake is twitter’s open-source distributed ID generation algorithm with 64 bits,

The first digit is 0, which is the flag bit, because in binary numbers, the first digit is 0, which is positive.

The next 41 bits are a 13-bit millisecond timestamp, up to September 2039.

The next 10 bits are the workID, which is the id of the server.

The last 12 bits are the business serial number

This means that a maximum of 2 ^ 12 ids can be generated per millisecond, 4096 ids can be generated per millisecond per machine, and more than 4 million ids can be generated per second

advantages

High efficiency

The efficiency of ID generation is relatively fast, the highest in 1ms can generate 2 power 12, that is, 4096 ids.

No dependencies on other components

In the process of id generation, it is mainly generated according to the timestamp, workID and serial number. Id can be generated independently depending on the local system time without additional dependence on other components.

disadvantages

Prone to data skew

For example, there are 10 ids in a database

0,25,26,27,28,29,30,31,32,100

The minimum value of ID is 0, and the maximum value is 100. According to the maximum value of ID minus the minimum value, the range is divided into four segments, and the range is as follows:

[0.25),1Two ids, and the values are0
[25.50)8Two ids, and the values are25.26.27.28.29.30.31.32
[50.75),0Id [75.100There]1a100And the value is100
Copy the code

In this way, there will be an obvious problem of data skew, that is, the number of ids in the interval [25,50] is very large, while the number of ids in other intervals is very small. If we use Sqoop to import MySQL data into Hive, we will divide the data according to the maximum value of ID minus the minimum value. Then we will import data from multiple threads, each thread is responsible for one fragment of data, if the data is not uniform, the import time will be longer. Some threads allocate less data and import quickly, while others allocate more data and import slowly. The total import time depends on the time of the slowest thread.

Use the iD generated by Snowflake. The size of the ID value depends on the timestamp generated when the ID is generated. If a large number of articles are crawled in a certain period of time, a large number of ids are generated in a very short time, while the number of ids generated in other periods is very small, data skew will occur when using Sqoop to import data. You need to slice the data separately, make the data even, and then import it.

Ii. Research on ID generation schemes of current large companies

headlines

The headline content is mainly divided into articles, atlas and videos. The data sources of these three contents are mainly published by editors and captured from other websites.

Research methods

This is a headline of an article www.toutiao.com/a6841306705…

The id is 19 bits, so it is unlikely that there will be so many articles in the database that they will be generated by increasing the primary key ID of the database, so there is a good chance that it will be generated by Snowflake.

We convert 6841306705796530700 to binary

10111 10111 10001 00110 10000 11001 11011 11101 00000 00000 00010 00001 100

It’s 63 bits in total, because the original Snowflake algorithm was 1 flag bit +41 milliseconds timestamp +10 machine bits +12 serial numbers, so we take the first 41 bits

10111 10111 10001 00110 10000 11001 11011 11101 0

To convert to base 10, get

1631094623994

Convert to a millisecond timestamp, yes

The 2021-09-08 17:50:23

This is in the future, so it is not likely to be such a solution, because in fact, the ID concurrency in 1ms does not need to be 2 ^ 12, which is 4096, and the QPS in 1s can be several hundred, which is extremely large.

Principle: 80% of daily visits are concentrated in 20% of the time, which is called peak time formula: (Total PV * 80%)/(seconds per day * 20%) = Peak time requests per second (QPS) Peak time QPS per second/QPS per machine = Required machine Q: how many QPS does 300 PV per day on a single machine require for this machine? Answer :(3000000 * 0.8)/(86400 * 0.2) = 139 (QPS)Copy the code

So if you do this, if you have 3 million requests per day, and you amortize to peak time, the peak QPS per second is 139. It is also difficult to get 3 million articles into the library every day, resulting in the need for 3 million ids, so QPS is actually very small. The headline probably uses 1 flag bit +31 second level timestamp +32 machine and sequence bits for itself, so we take the first 31 bits

10111 10111 10001 00110 10000 11001 1

The conversion to base 10 is

1592865843

In terms of second timestamps, which is

The 2020-06-23 06:44:03

With the article published at 2020-06-23 06:44:52 is very close. The time difference may be due to the fact that the timestamp in the ID is the creation time of the entry (the editor clicks the New article button to enter the edit page), while the interface shows the publication time of the article.

Detailed research results

After we extract some IDS and convert them to binary, we guess the composition of the binary bits of ID is31 bit second level timestamp +10 bit sequence bit +18 bit reserved bit +4 bit machine bitIn fact, if you look at the table, you can find that all the reserved slots of 18 are000000000000100000. Support a maximum of 32 ID generating machines to run at the same time, single machine QPS up to 1024, total QPS up to 32768. (The function of 18 reserved bits is not known at present)

Meituan

The cat’s eye film

The id range is 1 to 8875, a total of 8875 movies.

  • maoyan.com/films/1
  • maoyan.com/films/2
  • maoyan.com/films/428
  • maoyan.com/films/429

Hazelnut b&B – discontinuous increment

It is found that the IDS do not have data until 2500000, and many ids are not continuous. It is speculated that THE LEAF-seinterfaces mode is used to generate ids, fetch an ID number segment from the database every time, store the available ids into the memory, and then wait for the business system to issue the ids. And do their own additional ID discarding strategy, so that ID discontinuous to ensure information security and improve the difficulty of grasping.

  • Phoenix.meituan.com/housing/268… / / no data
  • Phoenix.meituan.com/housing/268… / / data
  • Phoenix.meituan.com/housing/268… / / data
  • Phoenix.meituan.com/housing/268… / / no data
  • Phoenix.meituan.com/housing/268… / / data

Meituan takeout order no

They are definitely generated using the Snowflake algorithm, but since their time bits use a time difference (i.e., a timestamp minus a specified point in time) and the bit allocation may be extra processing, it’s not easy to infer.

take-out Place the order of time
5026 0271 7642 2612 7 The 2018-09-02 17:46:03
6233 7340 1594 0462 The 2018-07-17 18:34:11
3962 9431 3190 0632 1 The 2018-07-15 11:52:55
2273 5322 6921 7033 5 The 2017-06-30 11:39:01

The home of the car

The existing data from 4 to 6 bits, the current maximum 320W +, the implementation principle should be the database primary key increment, because they are relatively large, should be used to set the step of multiple databases, to multiple databases to generate ID.

Below is the statistical table

Work id URL type
9999 chejiahao.autohome.com.cn/info/9999 The shift
3200292 Chejiahao.autohome.com.cn/info/320029… The shift
3200293 Chejiahao.autohome.com.cn/info/320029… video
3200294 Chejiahao.autohome.com.cn/info/320029… The article
3292759 Chejiahao.autohome.com.cn/info/329275… Single car
2037191 Chejiahao.autohome.com.cn/info/203719… audio

conclusion

It seems that currently large companies still use Snowflake as a distributed ID generation scheme. On the one hand, it can meet the demand of ID acquisition with high concurrency, and on the other hand, it has low ID continuity, which can ensure information security and improve the difficulty of crawler. In addition, for some services with small concurrency at the beginning, the primary key of the stand-alone database can be automatically added to generate IDS, using 4. The scheme of database segmentation to generate unique ID can also meet the high concurrency, but mainly id is continuous, even if the additional development OF ID discarding logic, it is easy for competitors to request one BY one to deduce the order quantity information.

Highlights:

How to ensure the consistency between cache and database in high concurrency scenario?

How do you clean expired keys?

How does MySQL solve the illusion problem?

How to execute a MySQL update statement?

What is your understanding of MySQL locks?

What is your understanding of Redis persistence?

What do you think of synchronized locks?

Talk about your understanding of HashMap.