Interviewer: How to ensure distributed unique global ID generation under high concurrency?

Concern public number: IT elder brother, read a dry goods article every day, a year later you will find a different self.

preface

System unique ID is a problem we often encounter when designing a system, and we often struggle with this problem.

This article is to provide you with a generation of distributed unique global ID generation scheme ideas, I hope to help you.

Deficiencies, please give more advice!!

The problem

Why do you need distributed globally unique ids and the business requirements for distributed ids

In a complex distributed system, it is often necessary to uniquely identify a large number of data and messages, such as finance, payment, catering and hotel on Meituan Dianping

The data in the system of products such as Cat’s Eye gradually increases, and a unique ID is needed to identify a piece of data or information after the database is divided into different databases and tables.

Special Ian orders, riders, coupons need to be marked with a unique ID

A system capable of generating globally unique ids is necessary at this point

Mandatory requirements for ID generation rules

Globally unique
Increasing trend
In MySQL’s InnoDB engine, clustered indexes are used. Since most RDBMS use Btree data structure to store indexes, we should try to use ordered primary keys to ensure write performance
Monotone increasing
Ensure that the next ID is greater than the previous ID. For example, the transaction version number, IM incremental messages, and sorting are required
Information security
If the ID is continuous, malicious users crawling work is very easy to do, directly in accordance with the order to download the specified URL, if the order number is dangerous, competitors can directly know our single amount of a day, so in some application scenarios, the need for ID is irregular, so that competitors can not guess
Contain the timestamp
Is it also possible to quickly understand in development when the distributed ID was generated

The ID number generates availability requirements for the system

High availability
To issue a request for a distributed ID, the server is guaranteed to create a unique distributed ID for me 99.999% of the time
Low latency
Send a request for a distributed ID, and the server needs to be fast, fast
High QPS
For example, the server will have to successfully create 100,000 distributed ids in one go

Generic generic solution

UUID

RandomUUID (), UUID standard type contains 32 hexadecimal digits, hyphen into five segments, form 8-4-4-12 36 characters, very high performance, local generation, no network consumption.

There is a problem

Poor inbound database performance because the UUID is unordered

Unordered, cannot predict its generation order, cannot generate the increasing order of the number

First of all, distributed ids are generally used gradually, but according to the official recommendation of mysql, the primary key should be as short as possible. UUID each is very long, so it is not recommended.

Primary key, when ID is the primary key, there are some problems under certain circumstances

For example, in the case of DB primary key, UUID is very unsuitable for MySQL

Index, B+ tree index split

Since the distributed ID is the primary key, and the primary key contains the index, and the mysql index is implemented through the B+ tree, every new UUID data is inserted, to optimize the query, the underlying INDEX B+ tree will be modified, because the UUID data is unordered. Therefore, every insert of UUID data will greatly modify the B+ tree of the primary key, which is not good. Completely disorderly insert will not only cause some intermediate nodes to split, but also create a lot of unsaturated nodes, which greatly reduces the performance of database insert.

UUID can only ensure global uniqueness and does not satisfy trend increasing or monotone increasing

The primary key is automatically added to the database

stand-alone

In the distributed environment, the main principle of the database increment ID mechanism is: the database increment ID mechanism and mysql database replace into implementation, replace into is similar to insert function, the difference is: Replace into first attempts to insert into the table. If there is already a row in the table (based on the primary key or unique index), delete the row first. Otherwise, insert new data directly.

REPLACE INTO means to insert a record and REPLACE old data if the value of the unique index in the table encounters a conflict

REPLACE into t_test(stub) values('b');
select LAST_INSERT_ID();

Copy the code

Each time we insert, we find that the original data is replaced and the ID is increased

That’s enough

increase
monotonicity
uniqueness

In distributed cases, and not many concurrent cases, you can use this solution to obtain a global unique ID

Cluster Distributed cluster

Is the database increment ID mechanism suitable for distributed ids? The answer is not very

It is difficult to expand the system horizontally. For example, after defining the step size and the number of machines, what should we do if we want to add machines? Suppose there is a machine whose number is 1,2,3,4,5 (step size is 1), and one machine needs to be expanded. It seems ok to set the initial value of the second batch of machines much higher than that of the first batch, but assuming there are 100 machines on the line, how to expand the capacity at this time is a nightmare, so the system horizontal expansion scheme is complicated and difficult to achieve.

Database pressure is still very large, each time to obtain the ID has to read and write the database, very bad performance, does not comply with the distributed ID in the low latency and high QPS rules (under high concurrency, if all to the database to obtain the ID, it is very bad performance)

Generate global ID policy based on Redis

Stand-alone version

Because Redis is single-threaded, atomicity is guaranteed by nature and can be achieved using atomic operations INCR and INCRBY

INCRBY: Set the growth step

Cluster distribution

Note: in the case of a Redis cluster, the same as MySQL, you need to set a different growth step, and the key must be set to expire, you can use a Redis cluster to achieve higher throughput.

Assuming there are five Redis in a cluster, you can initialize each Redis to a value of 1,2,3,4,5, and set the step size to 5

The ID generated by each Redis is:

A: 16 11 16 21 B: 2 7 12 17 22 C: 3 8 13 18 23 D: 4 9 14 19 24 E: 5 10 15 20 25Copy the code

But the problem is that Redis cluster maintenance and maintenance is more troublesome, troublesome configuration. Because of the single point of failure, sentry watch

But the main problem is that you need to import the entire Redis cluster for one ID

Snowflakes algorithm

What is the

Twitter’s distributed increment ID algorithm, Snowflake

Initially Twitter migrated its storage system from MySQL to Cassandra (an open source distributed NoSQL database system developed by Facebook) because Cassandra did not have a sequential ID generation mechanism, so a globally unique ID generation service was developed.

Twitter’s distributed SnowFlake algorithm, SnowFlake, was able to generate 260,000 self-incrementally sortable ids per second in tests

Twitter’s SnowFlake generation ids can be generated in chronological order
The result of SnowFlake’s ID is a 64-bit integer of type Long (up to 19 in length when converted to a string).
No ID collisions (distinguished by Datacenter and workerID) occur in distributed systems and are efficient

In a distributed system, there are some scenarios that require globally unique ids, and basic requirements for ID generation

In a distributed environment, it must be globally unique
In general, monotonically incrementing is required because unique ids are usually stored in the database, and InnoDB’s feature is to store content in the primary key index of the leaf node, and incrementing from left to right, so for database performance, it is best to generate ID monotonically incrementing. The 36-bit UUID can be used to prevent ID collisions, but UUID has some disadvantages. First, it is relatively long, and UUID is generally unordered
There might also be a need for no rules, because if you use something like a unique ID as an order number, you need this rule in order not to let anyone know how many orders were placed in a day

structure

Several core components of the Snowflake algorithm

In Java 64bit certificates are of type long, so the ID generated in SnowFlake is stored in the long class

The first part

The highest bit in binary is the sign bit, where 1 represents a negative number and 0 represents a positive number. The generated ID is usually an integer, so the highest bit is fixed at 0.

The second part

The second part is the 41bit timestamp bit, used to record the timestamp, millisecond level

41 bits can represent 2^ 41-1 numbers

If used only to represent positive integers, the range is 0-2^ 41-1, minus 1 because the range of values that can be represented starts at 0, not 1.

This means that 41 bits represent a value of 2^ 41-1 milliseconds, which translates into 69.73 years

The third part

The third part is the working machine ID. 10Bit is used to record the working machine ID

Can be deployed on 2^10 = 1024 nodes, including 5-bit datacenterId and 5-bit workerID

The largest positive integer that can be represented with 5 bits is 2 ^ 5 = 31 digits to represent different data centers and machine codes

The fourth part

The 12-bit bit can be used to express positive integers as 2^12 = 4095, i.e. 0, 1, 2… 4094 represents 4095 ID serial numbers generated in the same timestamp of the same machine.

SnowFlake is guaranteed

All generated ids increase over time

No duplicate ids are generated across the distributed system because the datacenterId and workerId distinguish between them

implementation

Snowflake algorithm is written by scala algorithm, some people use Java implementation, Github address

Github.com/beyondfengy…

/** * @author beyond */ public class snowflake {/** * private final static */ long START_STMP = 1480166465631L; Private final static long SEQUENCE_BIT = 1; private final static long SEQUENCE_BIT = 1; Private final static long MACHINE_BIT = 5; Private final static long DATACENTER_BIT = 5; Private final static long MAX_DATACENTER_NUM = -1l ^ (-1l << DATACENTER_BIT); // Datacenter_num = -1l ^ (-1l << DATACENTER_BIT); private final static long MAX_MACHINE_NUM = -1L ^ (-1L << MACHINE_BIT); private final static long MAX_SEQUENCE = -1L ^ (-1L << SEQUENCE_BIT); Private final static long MACHINE_LEFT = SEQUENCE_BIT; private final static long MACHINE_LEFT = SEQUENCE_BIT; private final static long DATACENTER_LEFT = SEQUENCE_BIT + MACHINE_BIT; private final static long TIMESTMP_LEFT = DATACENTER_LEFT + DATACENTER_BIT; private long datacenterId; // Private long machineId; Private long sequence = 0L; Private long lastStmp = -1l; Public SnowFlake(long datacenterId, long machineId) { if (datacenterId > MAX_DATACENTER_NUM || datacenterId < 0) { throw new IllegalArgumentException("datacenterId can't be greater than MAX_DATACENTER_NUM or less than 0"); } if (machineId > MAX_MACHINE_NUM || machineId < 0) { throw new IllegalArgumentException("machineId can't be greater than MAX_MACHINE_NUM or less than 0"); } this.datacenterId = datacenterId; this.machineId = machineId; } public synchronized long nextId() {long currStmp = getNewstmp(); if (currStmp < lastStmp) { throw new RuntimeException("Clock moved backwards. Refusing to generate id"); } if (currStmp == lastStmp) {if (currStmp == lastStmp) {if (currStmp == lastStmp) { If (sequence == 0L) {currStmp = getNextMill(); }} else {// In different milliseconds, set the sequence number to 0; } lastStmp = currStmp; Return (currStmp - START_STMP) < < TIMESTMP_LEFT / / part timestamp | datacenterId < < DATACENTER_LEFT/part/data center | machineId < < MACHINE_LEFT / / machine identifier portion | sequence; } private long getNextMill() {long mill = getNewstmp(); while (mill <= lastStmp) { mill = getNewstmp(); } return mill; } private long getNewstmp() { return System.currentTimeMillis(); } public static void main(String[] args) { SnowFlake snowFlake = new SnowFlake(2, 3); for (int i = 0; i < (1 << 12); i++) { System.out.println(snowFlake.nextId()); }}}Copy the code

Project landing experience

Hutools toolkit

Address: github.com/looly/hutoo…

SpringBoot integrates snowflake algorithms

Introduce the Hutool utility class

<dependency> <groupId>cn. Hutool </groupId> <artifactId>hutool-all</artifactId> <version>5.3.1</version> </dependency>Copy the code

integration

/** * public class SnowFlakeDemo {private long workerId = 0; private long datacenterId = 1; private Snowflake snowFlake = IdUtil.createSnowflake(workerId, datacenterId); @postconstruct public void init() {try {// Convert network IP to long workerId = netutil.ipv4tolong (netutil.getLocalHostStr ()); } catch (Exception e) { e.printStackTrace(); }} public synchronized long snowflakeId() {return this.snowflake. NextId (); } public synchronized long snowflakeId(long workerId, long datacenterId) { Snowflake snowflake = IdUtil.createSnowflake(workerId, datacenterId); return snowflake.nextId(); } public static void main(String[] args) { SnowFlakeDemo snowFlakeDemo = new SnowFlakeDemo(); for (int i = 0; i < 20; i++) { new Thread(() -> { System.out.println(snowFlakeDemo.snowflakeId()); }, String.valueOf(i)).start(); }}}Copy the code

results

1251350711346790400 1251350711346790402 1251350711346790401 1251350711346790403 1251350711346790405 1251350711346790404 1251350711346790406 1251350711346790407 1251350711350984704 1251350711350984706 1251350711350984705 1251350711350984707 1251350711350984708 1251350711350984709 1251350711350984710 1251350711350984711 1251350711350984712 1251350711355179008 1251350711355179009, 1251350711355179010,Copy the code

The advantages and disadvantages

advantages

The number of milliseconds in the high dimension, the increment sequence in the low, the whole ID is trend increasing
It does not rely on third-party systems such as databases, and is deployed as a service, providing higher stability and high performance in ID generation
You can allocate bits according to your service characteristics, which is very flexible

disadvantages

It depends on the machine clock. If the machine clock is rolled back, duplicate ids will be generated
It is incremental on a single machine, but due to the distributed environment, the clock on each machine can not be completely synchronized, sometimes not global incremental situation, this shortcoming can be regarded as indifferent, generally distributed ID only requires trend increasing, will not strictly increase, 90% of the demand only requires trend increasing.

Other supplementary

To solve the clock rollback problem that causes ID duplication, someone later proposes a solution
Baidu open-source distributed unique ID generator UidGenerator
Leaf-meituan-dianping distributed ID generation system