Kafka time wheel algorithm

The background,

Because the former company had a project (one for operation students, which could be configured to circle the crowd and send red envelopes, coupons and SMS, Push, stand inside letter, etc Ability to push marketing push system, most of the tasks are delay time way to send, so the implementation of extensive use of the company based middleware team development delay message middleware, and can implement ability) of the specified time delivery task, at the time of reading this middleware source, I found it borrowed the idea of the time wheel algorithm, so I looked into it.

First of all, the time wheel algorithm is really a little round, but its idea is related to the model in real life, that is, the clock. After studying the relevant implementation algorithm, I am very impressed by the clock, which seems to be common and accustomed to life. The person who invented these things is really strong. Of course, those who have realized the time wheel algorithm are equally powerful. The algorithm written by others is completely understood after two days of living research, which is recorded here

Second, algorithm and source code implementation

In this paper, I also after reading some netizens analysis, because they feel they are algorithm implementation of some of the “join” no clear instructions, at that time, after watching and thinking for a long time, want to see, mainly in the article on the basis of perfecting the introduction to the content of the “points” of weak phase. So before reading this article, it is recommended to read the following basics about time wheels. The relevant articles I read are

My.oschina.net/anur/blog/2… There are also some quotes in this article that you can read as well. Here are some of my understandings:

Time wheel: I think it could also be called a “time line” because time goes straight ahead. If you use a time stamp, it’s an increasing number of longs. But why is it called a “wheel”? Before we explain this, let’s look at the hierarchical time wheel model

Figure 1. The clockCopy the code

Figure 2. Second dialCopy the code

Figure 3. Time minute second dialCopy the code

As 3 above picture, figure 1 is a clock in real life, its respectively are second hand, minute hand and hour hand, and a dial scale was divided into 60 (also can be seen as 60 grid, but here USES the scale that two said meaning is the same but because the second hand every time after the scale will fall on the next line, so good understanding some). Since there are 60 scales, it can also be regarded as a base 60 counting system, where the second hand goes around once, the minute hand goes around one scale, and the hour hand stays the same (because the hour hand goes around one scale, which is not yet an hour). So this clock uses three tree counters to represent the number of seconds in 12 hours which is 12 times 60 times 60 is 43,200 seconds. And of course for times greater than this tree it also extends to weeks, months, years, and so on, which is essentially a base idea. For a basic dial, the time accuracy of this timer is up to seconds, and you can't measure time more accurately than seconds. Although the scale on the dial is constantly circulating, time in reality is always passing, and the timestamp representing time is constantly increasing, so in fact, I think the time wheel can be represented by the following figure:

Figure 4. Real time wheelCopy the code

To explain the above image: First of all, the image above shows three dials with the same number of dials 10. The first counter dial with each scale representing 20ms forward. The second is a counter dial with each scale representing 200ms forward (that is, the first to complete a week), and the third is a counter dial with each scale representing 2000ms forward (that is, the second to complete a week). We can say the time range is fixed, of course the fixed scope is limited here, had only three wheels can imagine a respectively only second hand, minute hand and hour hand of the dial, for the same face, they all have 10 scale (clock is 60 scale), each scale represent time moving forward 20 units (units can be milliseconds, seconds, Minutes, hours and so on. Each scale of the clock is 1 unit, that is 1s). But one of the dials on the latter dial is the sum of the time from the beginning to the end of the previous dial, which is important because in order to represent the time beyond the first or second dial, you have to keep increasing the time range represented by the dial. And finally, there’s a finite time range for all three dials, but because there’s currentTime that’s constantly moving forward and getting bigger and bigger, that currentTime pointer with a scale is constantly going to be a future time stamp. (Increasing and decreasing the dial increases the future time range represented)

For time-dependent tasks, we hang the task below the corresponding point in time and wait for the time to execute the task. For a delayed task, the point at which it is mounted is at some point in the future. So if you analyze it, all you do is hang the task at some point in time, and then drive the currentTime pointer forward, and at some point in time there’s a task that’s mounted below it, execute it. Again abstract, let’s take a few examples:

* * * * * * * * * * * * * * * * * * * * * * * *

For example, 0:

I want to play a song (a task) at 20:00:00.000 now (the timestamp is 0) at 20:00:00.005 after 5ms, so obviously I just need to hang the task at the currentTime pointer in Figure 4. Why is that? Because the minimum particle size of the dial a scale on behalf of 20ms, 5ms included can not distinguish, directly executed.

Example 1:

I want to play a song (a task) at 20:00:00.000 now (the timestamp is 0) at 20:00:00.023 20ms later, so obviously I just need to hang the task on the scale between 20 and 40 in Figure 4. Then currrentTime continues at a scale of 20ms, finds a task at the boundary between 20 and 40, and fetches the task. Now the actual execution time of the task is 20:00:00.023, and currentTime is 20:00:00.020, they are only 3ms apart. The time is less than one scale, so it is executed immediately

Example 2:

Other conditions are the same as in example 1, I just want to add 230 milliseconds to the current time, which is 20:00:00: At 230, an alarm clock is executed, at which time the task should be hung on the line intersecting 200 and 400, because at this time the second wheel, each scale represents 200ms, which is equivalent to the precision of the dial is 200ms, there is no way to distinguish a more accurate range, so the tasks between 200ms and 400 will be at 20: 00:00:20 0 was out, but at the moment, we need to take out the task of doing the check, whether this task should be performed, because actually we is the smallest dial minimum scale in 20 ms, namely accuracy, also said that we can accurately to the current time + 20 ms within the time, but at the moment we come out with the task of its execution time is 20: 00:00:230 (current time is 20:00:00: 200), it needs to be executed after 30ms, that is to say, the 30ms is greater than the calibration time of our minimum dial, we cannot execute it yet, we need to wait for the real time consumption of 20ms (the minimum dial completes one calibration, that is, 20ms), at this time the current time is 20:00:00: 220, the task execution time is only 10ms more than the current time, less than the smallest disk precision 20ms can be executed. ———— solution: because the task was created at 20:00:00.200, currentTime is the currentTime. The task execution time is only 30ms longer than this point in time, so the task should be mounted again at this point and placed on the 20 and 40 boundary.

In addition: the minimum scale is also called precision can also be understood as atomic operations can not be divided, are scale by scale execution, not to the middle of the execution

Well, the task execution mechanism associated with the three examples above has been clarified. Let’s analyze the source code:

The first is the introduction of TimeWheel:

A timewheel is defined as follows: TickMs is the time range represented by a scale wheelSize is the number of scales interval represents the time range represented by a dial interval=tickMs*wheelSize buckets It is used to hold tasks mounted on a time scale on the time wheel. The bottom table of the array is actually a scale mark. By dividing the difference between the execution time of the task and the current time by the range represented by a scale, namely tickMs, and supplemiting wheelSize, we can get the table below the scale, namely the table below buckets array, and then mount the task. CurrentTimestamp this is the current time of each time round itself, currentTimestamp is an integer multiple of precision, The delayQueue is a blocking queue that stores these buckets and a block queue that drives the use of time, as explained belowCopy the code

Different tasks may have the same execution time, so all tasks with the same execution time are mounted under the same bucket.

Figure 6. Specific deferred tasksCopy the code

OK, we already have the two basic elements needed for the system. So what's the algorithm that makes them work

1: a Timer manager

As we all know, according to the design mode, asynchronous task execution must have a time task manager — Timer. This manager provides an interface for the client to add delayed tasks at any time. The second is to be able to perform the task at the time specified by the task. Let’s do it one by one. Let’s start with a normal business scenario and add some delayed tasks

As follows:

Figure 7. Timer# addTask (TimedTask)Copy the code

As shown above, the logic is clear: the specific adding work is completed by time round, and the Timer is just for management, so the specific work still needs to be dispatched. A very important logic is that if the 68 lines cannot be added and not cancelled, they are handed over to the worker thread pool for execution. This is an important point, but we are clearly adding logic, why to execute, this is because if a task is going to be executed for less than the minimum precision of 20ms, then what is added, just execute. Just like in example 0. Look at the logic of timewheel.addTask (timedTask) :

Figure 8. TImer#addTask(TimedTask TimedTask) the main logic is as shown above: 1: task execution time - current time < current time round minimum precision return false, that is, direct execution does not need to mount 2: Task execution time - if the current time is greater than or equal to the current time wheel and the minimum precision is less than the whole time range of the current time wheel, it indicates that it can still be mounted under a certain scale. Then, according to the amount of time offset % scale, the position of bucket in the following table can be obtained. Then place the timedTask under the corresponding bucket, noting that the bucket is mounted under a certain scale. Time is an integer multiple of the scale. If this scale is the first task under this bucket, then the bucket is placed in a delay queue (why delay queue? Bucket can add that bucket has implemented interface Delayed. Why? 3: If the time range represented by this time wheel is insufficient, then go to the previous time wheel to perform the same logic for mountCopy the code

Take a look at how the last time wheel was constructed:

Figure 9 Timer# getOverflowWheel ()Copy the code

As shown above: for the parent time wheel, the time range represented by its scale is the time range of this time wheel. The number of scales is constant. The delayQueue is always the same, and the three time wheels are all inserted into a delayQueue. That's the logic for adding a task to summarize:

1: According to the comparison between the task time and the current time, if it is smaller than the scale time range of the current time wheel, execute 2 directly. If it is greater than or equal to the scale time that can be represented by the entire time wheel, the bucket is added to the corresponding bucket. The bucket is added to the delay queue 3: If it is greater than the time represented by the current time wheel, then the parent time wheel is found (created if null) and the same logic continues from 1Copy the code

CurrentTime is still the same timestamp as when it was created. If currentTime is not moved forward, the distance between it and the task mounted under the timeline is not narrowed. How can the task trigger execution?

So the immediate question is how do we move currentTime forward? The answer is DelayQueue

Remember that the bucket is placed in the DelayQueue, and the bucket is placed in order of the size of the bucket from the smallest to the largest, of course, after the current time. As shown in Figure 4, buckets are mounted one by one in the direction of the timeline. If the time is not up, no queue will be sent out. In the bucket, all tasks that need to be executed at this point in time will be placed. Delayqueue.poll (timeout, timeununit.milliseconds) :

Figure 10 Timer# advanceClock (long timeout)

So above:

1: Timeout is generally the scale time of a smallest disk. In this paper, 20ms is used to simulate the minimum disk time. Within the time of advancing one scale, the bucket mounted at this scale can be retrieved. Note that this bucket is out of time in the real world, but be careful

The time of the bucket is the scale time, which is an integer multiple of the scale time range. It is also the start time of the scale. The specific time of the tasks in the bucket can be different

2: If you go through a scale that has a task bucket mounted on it, you update the currentTime of the currentTime wheel first, which is important because the delay queue or the real time is actually moving forward, and if you find a task, the currentTime of the time wheel should be updated, And this update is a recursive attempt to update the parent time wheel. The source code is as follows:

Figure 11. TimeWheel# advanceClock (long timestamp)Copy the code

1: Timestamp represents the time when the bucket was pulled out. For the smallest time round, timestamp is at least equal to currentTimestamp + because the time has moved forward by one scale TickMs, timestamp is greater than currentTimestamp + tickMs if delayqueue.poll skips a few ticks where no data is mounted. CurrentTimestamp will be consistent with the current time

2: Try to update the parent time wheel, this operation will be trigger as long as the overflowWheel is not null, but the parent currentTimestamp is not necessarily changed, because the child time wheel moves forward 20m at a time, it takes 10 times to reach the parent time wheel tickMs

Final step:

This is essentially processing the tasks in the pulled bucket, because the bucket could be pulled from the second or third time dial, Therefore, the execution time of timedTasks in these buckets may be greater than the time represented by one scale of the smallest disk, because the time range represented by one scale of the second or three time wheels is too large, For example, the first scale of the second time round indicates the current time +200ms. All tasks within this time range are placed in the same bucket. Therefore, we need to re-execute addTask(TimedTask) for tasks in the bucket because the current time has changed. So they need to recalculate the number of their time round and the bucket to which the task belongs. The code is as follows:

Figure 12. Bucket. Flush (Consumer < TimedTask > flush)Copy the code

Consumer Flush is addTask(TimedTask TimedTask) (this method has been analyzed above), so the logic is to iterate through the tasks in the bucket and then add them again

2. At the same time, bucket is an element of a delay queue. 3. After remove the bucket, updated every time round the current time, to reduce the gap and task execution time more in the future 5: traversal take out the task in the bucket, if the task execution time and the time difference is less than the minimum time round a time range scale, can perform the above is the whole execution process by answering a few questions: If we do not use delay queue to simulate time advance, we can also use the while loop, but this will be in rotation, the CPU can not carry, if the while sleep, it can not guarantee the minimum precision of execution 2: Firstly, bucket implements the elements that belong to the blocking queue, can the tasks be directly delayed to the specific time without bucket, then there will be too many queued elements and the time complexity of queue entry will be O(logn). However, if bucket is used, The number of times an element is queued is reduced, and the task location bucket is an O(1) complexity reference:  https://my.oschina.net/anur/blog/2252539 https://juejin.cn/post/6844903648397525006 https://www.jianshu.com/p/87240220097bCopy the code

The background,

Second, algorithm and source code implementation

For example, 0:

Example 1:

Example 2:

1: a Timer manager

Related Posts

Recommend a code visualization site for Python beginners!

Share a Flink checkpoint failure problem and solution

MybatisPlus source code analysis, from MybatisPlusAutoConfiguration to start