MongoID discussion on distributed unique ID generation scheme

The author: Liu Daiming (commonly used net name: Line like wind, commonly used game name: senior’s quilt), a literary youth (Nani? Youth? Yes, you read that right: The Definition of youth by the World Health Organization (2020) is 18-65 years old. Now qihoo 360 search technology department as a Web server technical expert position. This article is the author’s original work. Enjoy reading it.

background

MongoDB, which you have all used, will automatically generate a “_id” when querying the data after it falls to the disk, such as:

Db.test. insert({"name":" Tom "}) Query result: {" _id" : ObjectId(" 5FD049327fbb28868f4660a5 "), "name":" Tom "}Copy the code

MongoID is used as the primary key index, which is globally unique in the entire database, even in clustered cases.

So what does this ID do? Or in what situations will it be used? Of course, the following two conditions need to be met:

1. The amount of data is now or in the future relatively large and in an increasing state.

2. A unique ID is required.

Such as orders, coupons, messages, data to be retrieved or processed (map POI data, spider crawler data), etc.

What features are required for these scene ids:

1. Globally unique. Yes, two different orders generate the same ID, so how can we tell the difference?

2. The trend is increasing. The generated ID may be used as a database index, and some database indexes are B+ tree indexes, which, if trending upward, make better use of the disk’s block-read storage feature to improve write performance.

3. High performance and high availability. It cannot become a business bottleneck.

4. Discontinuity. If this ID is used externally, it is easy to know the amount of data in a day. If it is a detailed ID, it is also easy to batch capture.

5. The byte usage is small. If it’s all numbers, it’s more efficient to store and sort. For example, 1111111 has more advantages in storage and sorting than AAAAA-111111

MongoID constitute

MongoID is represented with 12 bytes, each two hexadecimal. The diagram below:

However, in ordinary business, Mr. ID is needed for the business side to use, so what are the plans?

Solution a:

Use the UUID generator (many programming languages come with UUID generation). The generated format is xxxxXXXX-XXXX-XXXX-XXXX-XXXXXXXX. The generated format is a long string, for example, 123E4567-E89B-12D3-A456-426655440000. If the key is used as the primary key, the storage and index performance is low.

Scheme 2:

Use incremented ids (programmatically generated or with mysql, Redis). But there are the following problems:

1. It does not satisfy the discontinuity of property 4, but it does not matter if it is only used internally.

2. What we are talking about is distribution, and if such a scheme is to provide services in a distributed way, the operating cost will be very high. This mainly occurs when a new machine is added or the machine is restarted abnormally.

For example, if we now have two machines A and B, machine A can start with the value 1 and step 2 (its generated ID is: 1, 3, 5, 7…). ;

Machine B starts at 2 and has a step of 2 (its generated ids are: 2, 4, 6…) To satisfy global uniqueness.

What if you need another machine as your business grows? Then we need to change the step size to 3. The starting value is set to a larger value than the current maximum generated, and the other two machines will have to adjust.

In addition, record the maximum value generated by the current machine and restore it after the machine restarts abnormally.

Solution 3:

Use number segments.

To do this, store a maximum value field in the database, and then start counting, updating the maximum value only when the value reaches MaxValue (a pre-allocated idea that only the maximum value is recorded). For example, if the maximum value of the database is 10000 and the step length is +10000 each time, and the current value of the transmitter is 0, each service obtains the number from the transmitter and sends the number +1 each time. Then, when the number reaches 10000, the database is updated. The maximum value in the database is changed to 20000, and the transmitter gradually increases from 10000 to 20000. If businesses want to be isolated, the business PARTY ID can be stored in the database.

Plan 4:

Based on Snowflake algorithm

This algorithm is the ID generation algorithm adopted by Twitter for internal distributed projects.

MongoID uses 8 bytes (64-bit), 4 bytes less than the MongoID bit, as follows:

The result is INT64. The first digit is reserved (positive number), and the rest bits are shown in the figure above.

The maximum machine ID is 1024,

The maximum self-increment serial number is 4096, that is, a maximum of 4096 numbers can be generated within 1 millisecond. If the number of numbers exceeds the requirement, you need to wait until the next millisecond (THE QPS is up to 4096000, which is basically unavailable for general services). 2^41/(1000360024*365)=69.7 years can be used without using the database. Performance is high, and because of its millisecond timestamp design, if it is not generated continuously in 1 millisecond, it will not be continuous.

So what’s the downside of the Snowflake algorithm?

That’s time callback. The Snowflake algorithm depends on system time, and if a time callback occurs, the generated ID cannot be guaranteed to be globally unique.

Is there a better way to do it?

I hope to make such a distributed ID generation based on the Snowflake algorithm:

1. The ID is generated independent of the system time, so that there is no time for callback.

2. Higher performance: After all, there is a lot of time spent on each call.

3. Minimum dependence, no operation required. In this way, no manual operation is required when the machine is added or restarted.

So I designed such a distributed ID generator – meteor algorithm.

Let’s compare it to the benchmark of the Snowflake algorithm.

Running machine: Lenovo Small New Pro13 Ryzen 5 3550H

Go version: 1.15

goos: windows goarch: Amd64 BenchmarkSnowflake-8 4917643 244 NS /op 0 B/ OP 0 ALlocs/OP BenchmarkMeteor-8 52173231 22.6 ns/op 0 B/op 0 ALlocs /opCopy the code

It can be seen that the meteor algorithm is 10 times faster than the snowflake algorithm. Let’s introduce the meteor algorithm.

Solution 5:

Meteor algorithm

Also 64-bit,

Algorithm composition:

Design intention:

The first digit is also kept (positive)

The Data bit is the second difference when the current NodeID is created. In this way, the system time is required only when the NodeID is created, and the system time is not required when the ID is generated later, preventing time rollback.

The NodeID bit is the NodeID. To ensure that the generated ID is unique, the NodeID must be added every time a new machine or service is restarted. In this way, even if a time callback occurs, the resulting ID is guaranteed to be unique because the NodeID is unique.

Auto-increment serial number, 11 bits, maximum 2048. Each time it is generated, the sequence number is incremented by +1. When it is full, the Data bit is +1.

Random digit, 3 digits. Why not combine the auto-increment sequence number and random digit into auto-increment sequence number bits? Mainly for property 4: discontinuity.

The generation of snowflake algorithm depends on milliseconds, and the time bit is very thin. Only the ID that is continuously generated within this millisecond will be continuous, which is very harsh (the generation will be continuous only when QPS reaches 4096000).

So we add this random digit to make sure that the generated ID is not continuous. So how do we generate random numbers? Most random numbers are seeded by system time, but this is not as high performance as system time. I want an efficient random number generation algorithm independent of system time. Finally, Xorshift algorithm is selected.

Generate 10 ids as follows:

5016762319896585
5016762319896596
5016762319896600
5016762319896614
5016762319896616
5016762319896626
5016762319896635
5016762319896640
5016762319896651
5016762319896656
Copy the code

And it boils down to,

1. The NodeID creation depends on the current system time, but the generation does not need the system time to achieve the de-system time, thus solving the time callback.

2. The NodeID must be the same each time. This ensures global uniqueness.

3. Random digits ensure that the generated ID is not continuous.

Nodeids can be registered with the Mysql increment ID.

MongoID discussion on distributed unique ID generation scheme

background

MongoID constitute

Related Posts

Classification model evaluation method and Python implementation

Golang implements simple crawler framework (5) — project reconstruction and data storage

Ali Cloud He Chuan: open and compatible cloud, computing nest to help partners cloud upgrade