Github standard star 100K! What is the latest Java Learning Roadmap for 2021?

Good afternoon, I’m Guide!

Today to share a friend to jingdong interview really met the interview question: “Why distributed ID? How do you do it in your project?” .

In this article, I will talk about my views and introduce distributed ids in detail including the basic requirements and common solutions for distributed ids.

This article is all in the form of plain English, hope to bring you help!

Original is not easy, if it is helpful, praise/share is the biggest encouragement to me!

Personal ability is limited. If there is anything that needs to be added/improved/modified, please feel free to point it out in the comments section and make progress together!

A distributed ids

What is ID?

In daily development, we need to use ID to uniquely represent various data in the system, such as user ID corresponding to only one person, commodity ID corresponding to only one commodity, order ID corresponding to only one order.

We also have various IDS in real life, such as ID card ID corresponding to and only corresponding to one person, address ID corresponding to and only corresponding to

Simply put, an ID is a unique identifier for data.

What is a distributed ID?

The distributed ID is the ID of a distributed system. Distributed ID does not exist in real life and belongs to a concept in computer systems.

Let me give you a simple example of a separate database and separate tables.

A project of our company uses standalone MySQL. However, unexpectedly, a month after the project went online, with more and more users, the data volume of the whole system would be bigger and bigger.

Standalone MySQL can no longer be supported, so it needs to be divided into database and table (sharing-JDBC is recommended).

After the split, data spread across databases on different servers, and the auto-increment primary key of the database can no longer meet the unique primary key generated. How do we generate globally unique primary keys for different data nodes?

This is where you need to generate distributed ids.

What are the requirements for a distributed ID?

As an essential part of distributed system, distributed ID is used in many places.

A basic distributed ID meets the following requirements:

  • Globally unique: The global uniqueness of an ID must be satisfied first!
  • High performance: The generation of distributed ids is fast and the consumption of local resources is small.
  • High availability: Services that generate distributed ids are guaranteed to be infinitely close to 100% available.
  • Easy to use: use, easy to use, fast access!

In addition to these, a good distributed ID ensures that:

  • Security: The ID does not contain sensitive information.
  • Ordered increments: The orderliness of ids can speed up database writes if you want to store ids in a database. And, a lot of times, we’ll probably sort by ID.
  • Have specific business meaning: Generated ids that have specific business meaning can make problem location and development more transparent (which business can be determined by the ID).
  • Standalone deployment: That is, a distributed system has a separate transmitter service dedicated to generating distributed ids. This allows the services that generate ids to be decoupled from business-related services. However, this also introduces the problem of increased network call consumption. In general, if there are many scenarios where distributed ids are needed, a independently deployed transmitter service is necessary.

Distributed ID common solution

The database

Database primary key increment

This approach is relatively straightforward, is the relational database through the increment of the primary key to generate a unique ID.

With MySQL as an example, we can do it in the following way.

1. Create a database table.

CREATE TABLE `sequence_id` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `stub` char(10) NOT NULL DEFAULT ' '.PRIMARY KEY (`id`),
  UNIQUE KEY `stub` (`stub`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Copy the code

Stub fields are meaningless and just placeholders for inserting or modifying data. In addition, a unique index is created for the stub field to ensure its uniqueness.

2. Byreplace intoTo insert data.

BEGIN;
REPLACE INTO sequence_id (stub) VALUES ('stub');
SELECT LAST_INSERT_ID(a);
COMMIT;
Copy the code

Instead of using insert into, we use replace into to insert data as follows:

1) Step 1: Try to insert data into the table.

2) Step 2: If the insert fails due to a duplicate error on a primary key or unique index field, delete the conflicting rows with duplicate key values from the table, and then try again to insert the data into the table.

The advantages and disadvantages of this approach are also obvious:

  • Advantages: relatively simple to implement, ID sequential increment, small storage consumption space
  • Disadvantages: small amount of concurrency support, database single point problem (can use database cluster to solve, but increased complexity), ID does not have specific business meaning, security issues (such as the order volume can be calculated according to the order ID increment rule, trade secret ah!) Access the database every time to obtain the ID (increase the pressure on the database, the acquisition speed is slow)

Database number segment mode

Database primary key auto-increment mode, each time to obtain the ID of the database to access the database, ID demand is large, certainly not.

If we can batch obtain, and then stored in memory, when we need to use, directly from the memory to take comfortable! This is what we call a database based number segment schema to generate distributed ids.

The number segment mode of database is also a relatively mainstream distributed ID generation method. Open source apps like Didi’s Tinyid are based on this approach. However, TinyId is further optimized by using two-number segment caching and adding multiple DB support.

With MySQL as an example, we can do it in the following way.

1. Create a database table.

CREATE TABLE `sequence_id_generator` (
  `id` int(10) NOT NULL,
  `current_max_id` bigint(20) NOT NULL COMMENT 'Current maximum ID',
  `step` int(10) NOT NULL COMMENT 'Length of segment',
  `version` int(20) NOT NULL COMMENT 'Version number',
  `biz_type`    int(20) NOT NULL COMMENT 'Business type'.PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Copy the code

The current_MAX_id and step fields are used to obtain the batch ID. The batch ID ranges from current_MAX_id to current_max_id+step.

The Version field is mainly used to solve concurrency problems (optimistic locks), and biz_type is mainly used to represent amateur types.

2. Insert a row of data.

INSERT INTO `sequence_id_generator` (`id`, `current_max_id`, `step`, `version`, `biz_type`)
VALUES
	(1.0.100.0.101);
Copy the code

3. Run the SELECT command to obtain the unique BATCH ID of the specified service

SELECT `current_max_id`, `step`,`version` FROM `sequence_id_generator` where `biz_type` = 101
Copy the code

Results:

id	current_max_id	step	version	biz_type
1	0	100	1	101
Copy the code

4. If it is not enough, update it and SELECT it again.

UPDATE sequence_id_generator SET current_max_id = 0+100, version=version+1 WHERE version = 0  AND `biz_type` = 101
SELECT `current_max_id`, `step`,`version` FROM `sequence_id_generator` where `biz_type` = 101
Copy the code

Results:

id	current_max_id	step	version	biz_type
1	100	100	1	101
Copy the code

Compared with the database primary key increment mode, the number segment mode of the database has less access times to the database and less pressure on the database.

In addition, to avoid single point problems, you can improve usability by using master-slave mode.

Advantages and disadvantages of database number segment schema:

  • Advantages: Sequential increment of ids and small storage space consumption
  • Disadvantages: database single point problem (can be solved by using database cluster, but increased complexity), ID does not have specific business meaning, security issues (such as the order volume can be calculated according to the order ID increment rule, trade secret ah!)

NoSQL

In general, NoSQL scenarios use Redis more. We can increment the id atomic order by Redis incr command.

127.0.0.1:6379 >setSequence_id_biz_type 1 OK 127.0.0.1:6379> incr sequence_id_biz_type (integer) 2
127.0.0.1:6379> get sequence_id_biz_type
"2"
Copy the code

To improve availability and concurrency, we can use Redis Cluser. Redis Cluser is an official Redis cluster solution (version 3.0+).

In addition to Redis Cluser, you can also use the open source Redis clustering solution Codis (recommended for large clusters such as hundreds of nodes).

In addition to high availability and concurrency, we know that Redis is based on memory and we need to persist data to avoid data loss after a machine is restarted or fails. Redis supports two different persistence modes: snapshotting (RDB) and append-only file (AOF). In addition, Redis 4.0 starts to support mixed persistence of RDB and AOF (disabled by default and enabled through the aof-use-rdb-preamble configuration).

I won’t say much about Redis persistence here. If you don’t know this part, you can see JavaGuide’s summary of Redis knowledge points.

Advantages and disadvantages of Redis solution:

  • Advantages: Good performance and the generated ids are sequential increments
  • Disadvantages: Similar to the disadvantages of the database primary key increment scheme

In addition to Redis, MongoDB ObjectId is often used as a solution for distributed ids.

The MongoDB ObjectId requires 12 bytes of storage:

  • 0 to 3: indicates the timestamp
  • 3 to 6: indicates the machine ID
  • 7 to 8: indicates the ID of a machine process
  • 9 to 11: self-value-added

Advantages and disadvantages of MongoDB solution:

  • Advantages: Good performance and the generated ids are sequential increments
  • Disadvantages: Need to solve the problem of duplicate IDS (when the machine time is not correct, it may cause duplicate ids), security problems (ID generation is regular)

algorithm

UUID

UUID is short for Universally Unique Identifier. The UUID contains 32 hexadecimal numbers (8-4-4-4-12).

The JDK provides a one-line method for generating UUID.

// Example: cb4a9EDE-FA5E-4585-b9bb-d60bce986eaa
UUID.randomUUID()
Copy the code

An example of UUID in RFC 4122 looks like this:

We will focus on the Version here. The UUID generation rules are different for different versions.

The five different Version values have different meanings (see Wikipedia’s description of UUID) :

  • Version 1: UUID is generated based on time and node ID (usually MAC address);
  • Version 2: UUID is generated from identifier (usually group or user ID), time, and node ID;
  • Version 3, version 5: Version 5 – Deterministic UUID is generated by hashing namespace identifiers and names;
  • Version 4: UUID generation using randomness or pseudo-randomness.

Here is an example of a UUID generated under Version 1:

The default version of the UUID generated by the randomUUID() method in the JDK is 4.

UUID uuid = UUID.randomUUID();
int version = uuid.version();/ / 4
Copy the code

In addition, there are four different values for Variant, which have different meanings. I will not introduce it here. It seems that it does not need attention at ordinary times.

Refer to wikipedia for a Variant of UUID.

As you can see from the introduction above, uuids are guaranteed to be unique because they are generated by rules that include ELEMENTS such as MAC addresses, timestamps, namespaces, random or pseudo-random numbers, and timings. The uuids generated by a computer based on these rules are guaranteed to be unique.

Although UUID can be globally unique, it is rarely used.

For example, using the UUID as the primary key of the MySQL database is not appropriate:

  • The database primary key should be as short as possible, and UUID can consume a lot of storage space (32 strings, 128 bits).
  • UUID is not sequential. In InnoDB engine, database primary key disorder can seriously affect database performance.

Finally, let’s take a look at the pros and cons of UUID. :

  • Advantages: Fast generation speed, simple and easy to use
  • Disadvantages: Large storage space consumption (32 character strings, 128 bits), insecure (THE ALGORITHM to generate UUID based on MAC address may cause MAC address leakage), disorderly (non-autoincrement), no specific service meaning, need to solve the problem of duplicate ids (when the machine time is incorrect, duplicate ids may be generated)

Snowflake algorithm

Snowflake is Twitter’s open-source distributed ID generation algorithm. Snowflake consists of 64-bit binary numbers divided into several parts, each of which stores data with a specific meaning:

  • Bit 0: sign bit (indicating positive and negative), always 0, useless, ignore.
  • Bits 1 to 41: a total of 41 bits, used to represent the timestamp, in milliseconds, can hold 2 ^41 milliseconds (about 69 years)
  • 42nd to 52nd digits: a total of 10 digits. Generally, the first five digits indicate the equipment room ID, and the last five digits indicate the machine ID. (You can adjust the number based on the actual situation.) This allows you to distinguish between nodes in different clusters/machine rooms.
  • Bits 53 to 64: a total of 12 bits, used to represent the serial number. The serial number is auto-increment and represents the maximum number of ids that a single machine can generate per millisecond (2^12 = 4096). That is, a single machine can generate a maximum of 4096 unique ids per millisecond.

If you want to use The Snowflake algorithm, you usually don’t need to recreate the wheel yourself. There are many open source implementations based on the Snowflake algorithm, such as Meituan’s Leaf and Baidu’s UidGenerator, and these open source implementations optimize the original Snowflake algorithm.

In addition, we often modify the Snowflake algorithm in real projects, most commonly by adding business type information to the ids generated by the Snowflake algorithm.

Let’s look at the pros and cons of the Snowflake algorithm:

  • Advantages: Fast generation speed, orderly increment of generated ids, flexible (simple modification of Snowflake algorithm can be made, such as adding business ID)
  • Disadvantages: Duplicate ids need to be addressed (time-dependent, which can result in duplicate ids if the machine time is incorrect).

Open source framework

UidGenerator (baidu)

UidGenerator is a unique ID generator based on Snowflake.

However, UidGenerator improves Snowflake by generating the following unique ID composition.

As you can see, the composition of the unique ID generated by the original Snowflake algorithm is not quite the same. In addition, the above parameters can be customized.

The official documentation of UidGenerator is as follows:

After 18 years, the UidGenerator has barely been maintained, and I won’t cover it here. If you want to know more about UidGenerator, check out the official introduction.

The Leaf (Meituan)

Leaf is a distributed ID solution of Meituan open source. The project’s name, Leaf, comes from the German philosopher and mathematician Leibniz who said, “There are no two identical leaves in the world.” This name is really quite good, a little literary youth that flavor!

Leaf provides two modes, number segment mode and Snowflake(Snowflake algorithm), to generate distributed ids. In addition, it supports double number segment, and also solves the snowflake ID system clock callback problem. However, solving the clock problem requires a weak reliance on Zookeeper.

The birth of Leaf is mainly to solve the problem of various and unreliable methods for meituan business lines to generate distributed IDS.

Leaf improves the original number segment mode. For example, it adds a double number segment to avoid blocking the thread that requests to obtain the ID when the DB obtains the number segment. To put it simply, I take the initiative to get the next section in advance before I run out of one section (picture from Official article of Meituan: Leaf — Distributed ID Generation System).

According to the README of the project, on the basis of 4C8G VM, the QPS pressure test result was nearly 5W /s and TP999 1ms through the company RPC method.

Tinyid (drops)

Tinyid is an open source unique ID generator based on database number segment mode.

The principle of the database number segment schema has been described above. What are the highlights of Tinyid?

To understand this, let’s first look at a simple architectural solution based on the database number segment schema. (Image from Tinyid wiki: Introduction to Tinyid)

In this architectural pattern, we request a unique ID from the transmitter service through an HTTP request. The load-balancing router sends our request to one of the TinyID-Servers.

What’s wrong with this scheme? In my opinion (and in the official Tinyid wiki), there are two main questions:

  • In the case of obtaining a new number segment, the program is slow to obtain a unique ID.
  • You need to make DB highly available, which is cumbersome and expensive.

In addition, there is network overhead associated with HTTP calls.

The principle of Tinyid is relatively simple, and its architecture is shown as follows:

Compared with the simple architecture scheme based on database number segment schema, Tinyid scheme mainly makes the following optimizations:

  • Double-number segment cache: To avoid the slow process of getting a unique ID in the case of getting a new number segment. When the number segment in Tinyid is used to a certain extent, the next number segment is loaded asynchronously to ensure that there is always available number segment in memory.
  • Support for multiple dB: Support for multiple dB, and each DB can generate a unique ID, improve availability.
  • Added TinyID-client: pure local operation, no HTTP request consumption, performance and availability are greatly improved.

The advantages and disadvantages of Tinyid are not analyzed here, combining the advantages and disadvantages of database number segment mode and the principle of Tinyid can be known.

Summary of distributed ID generation schemes

In this article, I’ve basically rounded up the most common distributed ID generation schemes.

Afterword.

Finally, I recommend a great open source project for Java tutorial classes: JavaGuide. I started JavaGuide in my junior year of college when I started preparing for my fall job interview. At present, this project has 100K + STAR, related reading: 1049 days, 100K! Simple replay! .

It’s great for learning Java and preparing for your Java interview! As the author says, this is a Java learning + interview guide that covers the core knowledge most Java programmers need to master!

Related recommendations:

  • Graphic Computer fundamentals!
  • Ali ACM masters open source study notes! TQL!
  • Computer quality books search + learning route recommended!

I’m Guide, embrace open source and love to cook. Author of JavaGuide, Github: Snailclimb-Overview. In the next few years, I hope to continue to improve JavaGuide, and strive to help more people learn Java! ‘! 凎! Click here for my 2020 work Report!

In addition to the methods described above, middleware like ZooKeeper can also help us generate unique ids. There is no silver bullet, must be combined with the actual project to choose the most suitable for their program.