Today’s Recommendation:
GitHub Star 100K! What is the latest Java learning roadmap in 2021?
Good afternoon, I’m Guide!
Today, share a friend to Jingdong interview real encounter interview question: “why to distributed ID? What do you do on your project?” .
In this article, I will present my own views, and introduce the content related to distributed ID in detail, including the basic requirements of distributed ID and common solutions of distributed ID.
The whole article is in the form of vernacular, I hope it can help you!
Original creation is not easy, if it is helpful, thumb up/share is the biggest encouragement to me!
Personal ability is limited. If there is anything that needs to be added/improved/modified, please feel free to point it out in the comments section and make progress together!
A distributed ids
What is ID?
In daily development, we need to use ID unique representation for all kinds of data in the system, for example, user ID corresponds to and only corresponds to one person, commodity ID corresponds to and only corresponds to one item, order ID corresponds to and only corresponds to one order.
We also have various IDs in real life, such as the ID of the ID card corresponding to and only corresponding to one person, the address ID corresponding to and only corresponding to
Simply put, the ID is the unique identity of the data.
What is a distributed ID?
Distributed ID is the ID of a distributed system. Distributed ID does not exist in real life and belongs to the concept of computer systems.
Let me give you a simple example of library and table.
One project of our company uses standalone MySQL. However, I did not expect that one month after the project went online, with the increasing number of users, the data volume of the whole system would become larger and larger.
Standalone MySQL has no way to support, need to be divided into library tables (recommended Sharding-JDBC).
After the split, the data is spread across the database on different servers, and the self-incrementing primary key of the database is no longer sufficient to generate a unique primary key. How do we generate globally unique primary keys for different data nodes?
This is where the distributed ID is generated.
What requirements do distributed IDs need to meet?
As an essential part of distributed system, distributed ID is used in many places.
A basic distributed ID needs to meet the following requirements:
- Global uniqueness: the global uniqueness of ID must be satisfied first!
- High performance: Distributed ID generation is faster and consumes less local resources.
- High availability: Services that generate distributed IDs are guaranteed to be infinitely close to 100% available.
- Convenient and easy to use: ready to use, easy to use, fast access!
In addition to these, a good distributed ID also guarantees:
- Security: No sensitive information is contained in the ID.
- Orderly increment: If you want to store the ID in the database, the order of the ID can improve the database write speed. And, a lot of times, we’ll probably be sorting directly by ID.
- Have specific business implications: Generated IDs with specific business implications make it more transparent to identify the problem and develop it (the ID allows you to identify which business it is).
- Independent deployment: This is where the distributed system has a separate identifier service that generates the distributed ID. This allows the service that generates the ID to be decoupled from the business-related service. However, this also leads to the problem of increased consumption of network calls. In general, if there are many scenarios where distributed ID is needed, it is necessary to deploy the initiator service independently.
Common distributed ID solutions
The database
Database primary key increment
This approach is fairly straightforward, and it uses the self-incrementing primary key of a relational database to generate a unique ID.
Using MySQL as an example, we can do it in the following way.
1. Create a database table.
CREATE TABLE `sequence_id` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`stub` char(10) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
UNIQUE KEY `stub` (`stub`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
The stub field is meaningless and is just a placeholder so we can insert or modify data. Also, a unique index is created for the stub field to ensure its uniqueness.
2. Byreplace into
To insert data.
BEGIN;
REPLACE INTO sequence_id (stub) VALUES ('stub');
SELECT LAST_INSERT_ID();
COMMIT;
Instead of using INSERT INTO, we use REPLACE INTO to insert data. The procedure is as follows:
1) Step 1: Try to insert the data into the table.
2) Step 2: If insert fails due to duplicate data error in primary key or unique index field, delete the conflicting row with duplicate key value from the table first, and then try to insert data into the table again.
The pros and cons of this approach are also obvious:
- Advantages: relatively simple implementation, ID orderly increment, small storage consumption space
- Disadvantages: small amount of concurrency supported, database single point problem (can be solved by database cluster, but increased complexity), ID has no specific business meaning, security issues (such as the order ID can be calculated according to the increasing law of the order number, business secret!) Every time you get ID, you need to access the database (increased pressure on the database, and the speed is slow)
Database segment schema
Data base primary key increment mode, each access to the ID database, ID demand is relatively large, certainly is not good.
If we can batch access, and then stored in the memory, the need to use the time, directly from the memory to get comfortable! This is what we call the database-based segment pattern for generating distributed IDs.
The number pattern of database is also one of the mainstream distributed ID generation methods at present. Didi’s TinyID is based on this approach. However, TinyID is further optimized by using dual segment caching, adding multiple DB support, etc.
Using MySQL as an example, we can do it in the following way.
1. Create a database table.
CREATE TABLE 'sequence_id_generator' (' id' int(10) NOT NULL, 'current_max_id' bigint(20) NOT NULL, 'step' int(10) NOT NULL COMMENT ', 'version' int(20) NOT NULL COMMENT ', 'biz_type' int(20) NOT NULL, PRIMARY KEY (' id ')) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
The CURRENT_MAX_ID and STEP fields are mainly used to obtain the batch ID, which is: CURRENT_MAX_ID ~ CURRENT_MAX_ID + STEP.
The version field is mainly used to solve concurrency problems (optimistic locking), and the biz_type is mainly used to represent amateur types.
2. Insert a row of data.
INSERT INTO `sequence_id_generator` (`id`, `current_max_id`, `step`, `version`, `biz_type`)
VALUES
(1, 0, 100, 0, 101);
3. Get the batch unique ID of the specified service through SELECT
SELECT `current_max_id`, `step`,`version` FROM `sequence_id_generator` where `biz_type` = 101
Results:
id current_max_id step version biz_type
1 0 100 1 101
4. If not enough, update and re-select.
UPDATE sequence_id_generator SET current_max_id = 0+100, version=version+1 WHERE version = 0 AND `biz_type` = 101
SELECT `current_max_id`, `step`,`version` FROM `sequence_id_generator` where `biz_type` = 101
Results:
id current_max_id step version biz_type
1 100 100 1 101
Compared with the way of increasing the primary key of the database, the number mode of the database has less access times and less pressure on the database.
In addition, to avoid a single point of problem, you can improve usability from the master-slave mode.
The pros and cons of the database segment pattern:
- Advantages: ID orderly increment, small storage consumption space
- Disadvantages: database single point problem (can be solved by database cluster, but increased complexity), ID has no specific business meaning, security issues (such as the order ID can be calculated according to the increasing law of order, trade secret!)
NoSQL
In general, NoSQL schemes use Redis more. We can increment id atomic order by Redis incr command.
127.0.0.1:6379> set sequence_id_biz_type 1
OK
127.0.0.1:6379> incr sequence_id_biz_type
(integer) 2
127.0.0.1:6379> get sequence_id_biz_type
"2"
To improve availability and concurrency, we can use Redis CLUSER. Redis Cluster is the official Redis Cluster solution (version 3.0+).
In addition to Redis CLUSER, you can also use the open source Redis clustering solution CODIS (preferably for large clusters, such as hundreds of nodes).
In addition to high availability and concurrency, we know that Redis is memory based and we need to persist data to avoid data loss after machine restarts or machine failures. Redis supports two different types of persistence: snapshotting (RDB) and appending files (AOF). Also, Redis 4.0 is starting to support mixed persistence of RDB and AOF (off by default, but enabled via AOF-use-RDB-preamble).
I won’t talk much about Redis persistence here. If you don’t know this part of the content, you can take a look at the summary of the Redis knowledge points in JavaGuide.
The pros and cons of Redis:
- Pros: good performance and the generated IDs are orderly increasing
- Disadvantages: Similar to the disadvantages of the database primary key increment scheme
In addition to Redis, MongoDB ObjectID is often used as a distributed ID solution.
The MongoDB objectID requires a total of 12 bytes:
- 0~3: timestamp
- 3~6: Represents the machine ID
- 7~8: machine process ID
- 9~11: self increment
Advantages and disadvantages of the MongoDB solution:
- Pros: good performance and the generated IDs are orderly increasing
- Disadvantages: Need to solve duplicate ID problem (may cause duplicate ID when machine time is not correct), security problem (ID generation is regular)
algorithm
UUID
UUID is short for Universally Unique Identifier. The UUID contains 32 hexadecimal numbers (8-4-4-4-12).
The JDK provides an off-the-shelf method for generating UUIDs in a single line of code.
// Example of output: cb4a9ede-fa5e-4585-b9bb-d60bce986eaa uuid. randomUUID()
An example of a UUID in RFC 4122 looks like this:
We will focus on this Version here, and the rules for generating UUIDs are different for different versions.
The meanings of the five different Version values (see Wikipedia’s description of UUID) :
- Version 1: UUIDs are generated based on time and node IDs (usually MAC addresses);
- Version 2: UUIDs are generated based on an identifier (usually a group or user ID), time, and node ID;
- Versions 3, 5: Version 5 – Deterministic UUIDs are generated with hashing namespace identifiers and names;
- Version 4: UUIDs are generated using randomness or pseudo-randomness.
Here is an example of a UUID generated under Version 1:
The JDK default version of the UUID generated by the UUID randomUUID() method is 4.
UUID uuid = UUID.randomUUID(); int version = uuid.version(); / / 4
In addition, Variant also has four different values that correspond to different meanings. I won’t introduce it here. It seems that I don’t need to pay attention to it at ordinary times.
If you need it, check out Wikipedia’s description of Variant for UUID.
As can be seen from the above introduction, UUID can guarantee uniqueness, because its generation rules include MAC address, timestamp, Namespace, random or pseudo-random number, timing and other elements, and the computer generated UUID based on these rules is certainly not repeated.
Although a UUID can be globally unique, it is rarely used.
For example, using a UUID as the primary key of the MySQL database is not appropriate:
- Database primary keys should be as short as possible, and UUIDs consume a lot of storage space (32 strings, 128 bits).
- UUIDs are unordered. Under the InnoDB engine, the disorder of the database primary key will seriously affect the database performance.
Finally, let’s briefly analyze the advantages and disadvantages of UUID (you may be asked about it in the interview!). :
- Advantages: the generation speed is relatively fast, simple and easy to use
- Disadvantages: Large storage consumption (32 strings, 128 bits), insecure (algorithms that generate UUIDs based on MAC addresses leak MAC addresses), out of order (not autoincrement), no specific business meaning, need to solve the problem of duplicate IDs (which may result in duplicate IDs if the machine time is not correct)
Snowflake algorithm
Snowflake is Twitter’s open-source distributed ID generation algorithm. Snowflake consists of 64 bits of binary number divided into sections, each of which stores data with a specific meaning:
- Bit 0: The symbol bit (indicating positive or negative) is always 0. It is useless. Never mind.
- Bits 1 to 41: A total of 41 bits, used to represent the timestamp, in milliseconds, can support 2 ^41 milliseconds (about 69 years)
- Bits 42-52: a total of 10 bits. Generally speaking, the first 5 bits represent the ID of the machine room and the last 5 bits represent the ID of the machine (it can be adjusted according to the actual situation in the actual project). This allows you to distinguish between nodes in different clusters/rooms.
- Bits 53 to 64:12 bits in total, used to indicate the serial number. The serial number is self-incremative and represents the maximum number of IDs that a single machine can generate per millisecond (2^12 = 4096), which means that a single machine can generate up to 4096 unique IDs per millisecond.
If you want to use the Snowflake algorithm, you generally don’t need to reinvent the wheel yourself. There are many open source implementations based on Snowflake algorithm, such as Leaf of Meituan and UidGenerator of Baidu, and these open source implementations optimize the original Snowflake algorithm.
In addition, in real projects, we usually modify the Snowflake algorithm, the most common one is to add business type information to the ID generated by the Snowflake algorithm.
Let’s take a look at the pros and cons of the Snowflake algorithm:
- Advantages: Fast generation, sequential increase of generated IDs, flexible (simple modifications to Snowflake algorithm can be made, such as adding business IDs)
- Disadvantages: Duplicated ID issues need to be resolved (time dependent, which can result in duplicate IDs if the machine time is not correct).
Open source framework
UidGenerator (baidu)
UidGenerator is an open source Baidu unique ID generator based on Snowflake(Snowflake algorithm).
However, UIdGenerator improves on the Snowflake algorithm to generate the following unique ID composition.
As you can see, it’s not quite the same composition as the unique ID generated by the original Snowflake. And, the above parameters can be customized.
The description in the official documentation of UIDGenerator is as follows:
Since 18 years, the UidGenerator has been largely unmaintained, and I will not cover it here. For more information, check out the official description of the UidGenerator.
The Leaf (Meituan)
Leaf is a distributed ID solution from Meituan open source. The project’s name LEAF comes from a quote by German philosopher and mathematician Leibniz: “There are no two identical leaves in the world.” This name is really quite good, a little literary youth that taste!
Leaf provides two modes, segment mode and Snowflake(Snowflake algorithm), to generate distributed IDs. Also, it supports dual segments, and also resolves the Snowflake ID system clock reversal issue. However, the solution to the clock problem relies slightly on ZooKeeper.
Leaf was born mainly to solve the problem of multiple and unreliable methods of generating distributed IDs for Meituan lines of business.
Leaf improves on the original segment mode, such as adding a double segment to prevent the DB from blocking the thread requesting the ID while the DB is fetching the segment. To put it simply, I take the initiative to get the next segment in advance before I finish one segment (the picture is from the official article of Meituan: LEAF — distributed ID generation system for Meituan comment).
According to the project README introduction, on the basis of 4C8G VM, the QPS pressure test result was nearly 5W /s, TP999 1ms, through the company’s RPC call.
Tinyid (drops)
TinyID is a unique ID generator based on the database segment pattern of Didi Open Source.
The principle of the database segment pattern has been described above. What’s good about TinyID?
To understand this, let’s first look at a simple architectural scenario based on the database segment pattern. (Image taken from TinyID’s official wiki: An Introduction to TinyID’s Principles)
In this architectural pattern, we request a unique ID from the initiator service via an HTTP request. The load-balancing router will send our request to one of the tinyid-servers.
What’s wrong with this scheme? In my opinion (and in the official TinyID wiki, too), there are two main problems:
- In the case of getting a new segment, the program is slow to get a unique ID.
- The DB needs to be highly available, which can be cumbersome and resource consuming.
In addition, HTTP calls also have network overhead.
TinyID’s principle is simple, and its architecture is shown below:
Compared to the simple schema based on the database segment pattern, the TinyID scheme has the following optimizations:
- Double-numbered segment cache: To avoid getting a new segment, the program is slow to get a unique ID. When the number in TinyID is used to a certain extent, the next number is loaded asynchronously to ensure that the number is always available in memory.
- Added multiple DB support: Support for multiple DBs, and each DB can generate a unique ID, improving usability.
- Added tinyid-client: pure local operation, no HTTP request consumption, performance and availability greatly improved.
The advantages and disadvantages of Tinyid are not analyzed here, combined with the advantages and disadvantages of the database segment pattern and the principle of Tinyid can know.
Summary of distributed ID generation scheme
In this article, I’ve basically gone through the most common distributed ID generation schemes.
Afterword.
Finally, I recommend a very good open source project for Java tutorial classes: JavaGuide. I created javaGuide during my junior year of college, when I started preparing for interviews in the fall. At present, the project has 100K + STAR, related reading: 1049 days, 100K! Simple review! .
It is very helpful for you to learn Java and prepare for the interview in Java direction! As the author says, this is a guide to Java learning + interviewing that covers the core knowledge that most Java programmers need to know!
Related Recommendations:
- Graphic computer basics!
- Ali ACM big man open source study notes! TQL!
- Computer quality books search + recommended learning routes!
I am the Guide brother, embrace open source, like cooking. Author of open source project javaGuide, Github: snailclimb-overview. In the next few years, I hope to continue to improve JavaGuide and strive to help more friends to learn Java! ‘! 凎! Click to see my 2020 work report!
In addition to the methods described above, middleware such as ZooKeeper can also help us generate unique IDs. There is no silver bullet, be sure to combine the actual project to choose the most suitable for their own program.