This article refers to network, assault and deletion

This series of articles will be organized into my GitHub repository for the Java Interview Guide. Check out my repository for more highlights

https://github.com/h2pl/Java-Tutorial

If you like, please click Star

The article was first posted on my personal blog:

www.how2playlife.com

This series of posts will tell you what is a distributed system, it is very important for the back-end engineer a business, we will gradually learn common distributed technology, and some of the more common distributed system concept, but also need to further understand the zookeeper, distributed transaction, distributed lock and load balancing technology, In order to give you a more complete understanding of the actual practice of distributed technology, ready for the real application of distributed technology.

If you have any suggestions or questions about this series of articles, you can also contact the author on the public account “Java Technology Jianghu”. You are welcome to participate in the creation and revision of this series of blog posts.

Distributed ids generator | architects

Transferred from: 58 Shen Jian Road of Architect 2017-06-25

First, the origin of demand

Almost all business systems have a requirement to generate a unique record identifier, such as:

  • Message ID: message-id
  • Order ID: order-id
  • Post id: Tiezi-id

This record id is often the primary key in the database, and the database will build a cluster index, that is, sort the physical storage by this field.

Queries on this record identifier often have business requirements for paging or sorting, for example:

  • Pull up the latest page of news
    select message-id/ order by time/ limit 100Copy the code
  • Pull the latest page order
    select order-id/ order by time/ limit 100Copy the code
  • Pull the latest page of posts
    select tiezi-id/ order by time/ limit 100Copy the code

Therefore, there is usually a time field and a non-cluster index is built on the time field.

A common index stores a pointer to the actual record, and its access efficiency is slower than that of a clustered index. If the record identifiers can be generated in order by time, the index query of the time field can be omitted:

select message-id/ (order by message-id)/limit 100

Note that this can be done only if message-ID generation is essentially trend-time increasing.

This leads to two core requirements for record id generation (that is, the three XXX-ids mentioned above) :

  • Globally unique
  • Trends and orderly

This is also the core problem discussed in this article: how to efficiently generate globally unique ids with ordered trends.

2. Common methods, deficiencies and optimization

Method 1: use auto_INCREMENT to generate globally unique increment IDS

Advantages:

  • Simple, using the existing database functions
  • It guarantees uniqueness
  • Increments are guaranteed
  • Fixed step length

Disadvantages:

  • Availability is difficult to guarantee: the common database architecture is a master multiple slave + read and write separation, generating self-increasing ID is a write request, the master library is dead
  • Poor scalability and performance upper bound: Because writes are single points, the write performance of the database master determines the upper bound of ID generation performance and is difficult to scale

Improvement methods:

  • Redundant primary libraries to avoid single points of write
  • Data is horizontally segmented to ensure that the ids generated by each master database are not repeated

As mentioned in above, written by a library into three libraries written, each write library set different auto_increment initial value, and the growth of the same step length to ensure that each database generated ID is different (above 0 generation 0,3,6,9 in library… , library 1 generates 1,4,7,10, and library 2 generates 2,5,8,11…)

The improved architecture ensures availability, but the disadvantages are:

  • Loss of “absolute incrementation” in ID generation: accessing library 0 to generate 0,3, and then accessing library 1 to generate 1 May result in ID generation not being absolutely incremented for a very short period of time (this is not a big problem; the goal is trend increments, not absolute increments)
  • There is still a lot of write pressure on the database, which needs to be accessed every time an ID is generated

To solve these two problems, a second common solution is introduced.

Method 2: Single point batch ID generation service

One of the important reasons why distributed systems are difficult is that “without a global clock, it is difficult to ensure absolute timing”. In order to ensure absolute timing, we can only use a single point of service and use a local clock to ensure “absolute timing”.

The database write pressure is high because the database is accessed each time the ID is generated. You can use batch method to reduce the database write pressure.

As shown in the figure above, the database uses a double master to ensure availability, and only the maximum number of current ids, such as 0, is stored in the database.

The ID generation service assumes that six ids are pulled in batches each time. The service accesses the database and changes the maximum number of ids to 5. In this way, the application access ID generation service requests ids.

When all the ids are sent out, change the maximum number of ids to 11, and you can send out ids 6,7,8,9,10,11 again, reducing the database pressure to 1/6 of its original value.

Advantages:

  • The absolute increment order of ID generation is guaranteed
  • Greatly reduce the pressure of the database, ID generation can be done to generate tens of thousands of thousands of seconds

Disadvantages:

  • The service is still a single point
  • If the service hangs, and the service is restarted, it may continue to generate ids that are discontinuous and empty in the middle. (the service memory holds 0,1,2,3,4,5, and the database max-id is 5. At 3, the service restarts.
  • Although tens of thousands of ids can be generated per second, there is still a performance ceiling that cannot be scaled horizontally

Improvement methods:

A common high availability optimization for a single point of service is the “standby service”, also known as the “shadow service”, so we can optimize the above disadvantages in the following ways (1) :

As shown in the figure above, the externally provided service is the master service, and a shadow service is always in the standby state. When the master service is suspended, the shadow service takes over.

This switching process is transparent to the caller and can be done automatically. The common technique is VIP + Keepalived, which is not explained here.

In addition, the id-Gen-service can also be extended horizontally to solve the shortcomings mentioned above (3), but it may cause consistency problems. For details, see “”.

Method 3: UUID/GUID

Whether the ID is generated from a database or a service, the business Application needs to make a remote call, which is time-consuming.

Is there a way to generate ids locally with high performance and low latency?

Uuid is a common scenario:

string ID =GenUUID();

Advantages:

  • The ID is generated locally without remote invocation and with low latency
  • Good scalability, basically you can think of no performance ceiling

Disadvantages:

  • There is no guarantee of increasing trend
  • The UUID is too long and is often expressed as a string. As a primary key, the query efficiency is low. The common optimization solution is converted to two uint64 integer storage or half-storage (uniqueness cannot be guaranteed after half-storage).

Method 4: Set the current number of milliseconds

Uuid is a local algorithm with high generation performance, but it does not guarantee trend incrementation, and it is inefficient to retrieve as a string ID. Is there a local algorithm that can guarantee incrementation?

Taking the current number of milliseconds is a common scenario:

uint64 ID = GenTimeMS();

Advantages:

  • The ID is generated locally without remote invocation and with low latency
  • The generated ids are in an increasing trend
  • The generated ID is an integer. After the index is created, the query efficiency is high

Disadvantages:

  • If the concurrency exceeds 1000, duplicate ids are generated

So this is a fatal flaw, because there’s no guarantee that ID is unique. Of course, using microseconds can reduce the probability of collisions, but you can only generate 1,000,000 ids per second at most, and any more will definitely collide, so using microseconds doesn’t solve the problem at all.

Method 5: Like snowflake algorithm

Snowflake is an open-source distributed ID generation algorithm for Twitter. Its core idea is a long ID:

  • 41bit Indicates the number of milliseconds
  • 10bit indicates the machine number
  • 12bit as the millisecond sequence number

The algorithm can theoretically generate up to 1000*(2^12) ID per second, that is, 400W ID, which can fully meet the needs of services.

It is possible to implement its own distributed ID generation algorithm by borrowing the ideas of Snowflake and combining the business logic and concurrency of each company.

For example, suppose a company ID generator service has the following requirements:

  • The peak single-machine concurrency is less than 1W, and the peak single-machine concurrency is expected to be less than 10W in the next five years
  • There are two equipment rooms, and the number is expected to be less than four in the next five years
  • The number of machines in each equipment room is less than 100
  • Currently, 5 business lines have ID generation requirements, and the number of business lines is expected to be less than 10 in the future

The analysis process is as follows:

  • The high takes the number of milliseconds from January 1, 2017 to the present (assuming the system ID generator service goes live after that time), assuming the system has been running for at least 10 years, which takes at least 10 years365 days24 hours3600 seconds1000 ms =320*10^9, almost 39bit reserved for milliseconds
  • If the single-machine peak concurrency per second is less than 10W, that is, the average single-machine peak concurrency per millisecond is less than 100, almost 7 bits are reserved for serial numbers per millisecond
  • If the number of equipment rooms is less than four within five years, 2 bits are reserved for the equipment room identifier
  • Each equipment room contains less than 100 machines. Reserve 7 bits for identifying servers in each equipment room
  • If the number of service lines is less than 10, 4 bits are reserved for identification of service lines

64-bit logos are designed to ensure that:

  • The ID generated by each line of business, each machine room, and each machine is different
  • The same machine generates a different ID every millisecond
  • For the same machine, within the same millisecond, the sequence number area distinguishes to ensure that the generated ID is different
  • Place the number of milliseconds in the highest digit to ensure that the generated ID is trending upward

Disadvantages:

  • Since “there is no global clock”, the ids assigned to each server are absolutely increasing, but the generated ids are only increasing globally (some servers are early, others are late).