preface

In fact, this article only starts from Kafka index, to describe the flexible use of algorithm in engineering based on the scene. Just because I read the source code when feeling and write it.

The importance of indexes

Index is no stranger to us, the table of contents of each book is the application of index in real life. Just a few pages allows me to quickly find what I need. Several pages are redundant and the time to consult is shortened. The interchange of space and time contains the philosophy of the universe.

The index of database in engineering field is indispensable, without index it is hard to imagine how to retrieve such a huge amount of data.

Now that the importance of indexes is clear, let’s take a look at how indexes are implemented in Kafka.

Indexing practices in Kafka

First, Kafka’s indexes are sparse, which prevents index files from taking up too much memory and allows more indexes to be stored in memory. Bytes is the Broker parameter log.index.interval.bytes. The default value is 4KB, that is, to create an index for 4KB messages.

There are three broad categories of indexes in Kafka: displacement indexes, timestamp indexes, and aborted transaction indexes. This corresponds to.index,.timeIndex, and.txnindex files respectively.

The associated source code is as follows:

AbstractIndex. Scala: Abstract class that encapsulates common operations for all indexes

Scala: displacement index, which stores the relationship between the displacement value and the physical location of the corresponding disk

Timeindex. scala: timestamp index that stores the relationship between the timestamp and the corresponding displacement value

Scala: TransactionIndex. This index only appears when Kafka transactions are enabled.

AbstractIndex’s definition is commented out in the code, and there’s an entrySize in the member variable. This variable is actually the size of each index entry, and each index entry has a fixed size.

entrySize

Override def entrySize = 8,8 bytes Override def entrySize = 12,12 bytes in TimeIndex.

Why 8 and 12?

In OffsetIndex, each index contains the offset value and the physical location of the disk, so 4+4=8. The physical location of the disk is an integer.

Therefore, the stored displacement value is actually the relative displacement value, that is, the real displacement value -baseOffset value.

Is the relative displacement stored as an integer? Yes, because the parameter log.segment.bytes of a log segment file size is an integer, the difference in the value of baseOffset on the index file corresponding to the same log segment must be within the integer range.

Why bother and save a difference?

1. To save space, an index entry saves 4 bytes. Think of companies that handle trillions of daily messages.

2, because memory resources are very precious, the shorter the index items, the more index items can be stored in memory, the more index items direct hit probability is high. This is the same reason why MySQL InnoDB recommends that primary keys not be too long. Each secondary index stores the value of the primary key. The longer the primary key, the more memory each index item occupies, the fewer indexes a cache page can fetch from disk at a time, and the more disk accesses a query may require. Disk access, as we all know, is slow.

Mutual conversion of the source code as follows, such a simple operation:

The above explains that the displacement value is 4 bytes, so the timestamp 8 bytes in TimeIndex + the displacement value 4 bytes = 12 bytes.

_warmEntries

What’s this for?

Let’s start by thinking that we can quickly find the message in the log segment by index entry, but how can we quickly find the desired index entry? An index file defaults to 10MB and an index entry to 8bytes, so a file may contain more than 100 W index entries.

Both messages and indexes are monotonically increasing and appending, so the data is ordered. Query quickly in the ordered set, what pop out in mind is dichotomy search!

Then make it a two!

What does this have to do with _warmEntries? What’s wrong with dichotomy?

In The case of Kafka, indexes are appended to the end of the file and the data is usually read immediately. So the hot spots are in the tail. And operating systems basically cache and manage memory on a per-page basis, and memory is limited, so memory will be weeded out through lRU-like mechanisms.

LRU seems to be a good fit for the Kafka scenario, but with standard binary lookup there is the possibility of page misses, since binary is accessed by hops.

Kafka’s comments are really clear

when looking up index, the standard binary search algorithm is not cache friendly, and can cause unnecessary page faults (the thread is blocked to wait for reading some index entries from hard disk, as those entries are not cached in the page cache)

Standard binary lookup is cache-unfriendly and can cause unnecessary page miss interrupts (threads blocked waiting to load data from disk that has not been cached in the Page cache) when we look up an index.

The comments also give friendly examples

To put it simply, suppose an index is 13 pages in the page cache, and the data has been written to 12 pages. Kafka accesses all data on page 12, so binary lookup is performed in the order of 0,6,9,11,12. These pages must be in the Page cache because they are frequently accessed.

When page 12 is continuously filled, a new page will be applied for page 13 to save the index item. According to the binary search feature, the cache page is accessed in the order of 0,7,10,12. These 7’s and 10’s haven’t been accessed for a long time, so they’re probably no longer in the cache, and they need to read data from disk. In their tests, this resulted in a delay of at least a few milliseconds to a second, the note said.

To address these issues, Kafka uses a modified version of binary lookup, which does not change the interior of binary lookup, and divides all index entries into hot and cold sections

This improvement allows you to always traverse the same Page when querying the hot data section, thus avoiding Page miss interrupts.

A consistent hash, as opposed to a regular hash, is a constant cache access when a node is added, or only a small amount of data is migrated.

Okay, so let’s look at the source code first, okay

It’s not that hard to implement, but why is 8192 at the tail a hot spot?

Here to mention the source code again, very detailed.

  1. This number is small enough to guarantee all the pages of the “warm” section is touched in every warm-section lookup. So that, the entire warm section is really “warm”. When doing warm-section lookup, following 3 entries are always touched: indexEntry(end), indexEntry(end-N), and indexEntry((end*2 -N)/2). If page size >= 4096, all the warm-section pages (3 or fewer) are touched, when we touch those 3 entries. As of 2018, 4096 is the smallest page size for all the processors (x86-32, x86-64, MIPS, SPARC, Power, ARM etc.).
The current processor cache size is 4096 pages, so 8192 can ensure that the number of pages is less than equal 3, and all the pages used for binary lookup will be hitCopy the code
  1. This number is large enough to guarantee most of the in-sync lookups are in the warm-section. With default Kafka 9. Settings, 8KB index pile to about 4MB (offset index) or 2.7MB (time index) log messages.
The 8KB index can cover 4MB (offset index) or 2.7MB (time index) of message data, which is enough for most nodes in in-sync to query in the hot zoneCopy the code

That explains what _warmEntries is and why it is needed.

It can be seen that the application of naive algorithms in real projects depends on specific business scenarios and cannot be mechanically applied. And a thorough understanding of the algorithm is also very important, such as rote memorization of dichotomy, even if you can’t see the above problems. And the importance of the underlying knowledge. Otherwise you wouldn’t know it was cache unfriendly.

From Kafka’s indexed hot and cold partitions to MySQL InnoDB’s buffer pool management

MySQL buffer Pool management MySQL divides the buffer pool into the new generation and the old generation. The default is 37 points, that is, the old age accounted for 3, the new generation accounted for 7. That is, the tail of a linked list is 30% old, and the front is 70% new. Replaced the standard LRU elimination mechanism.

MySQL buffer pool partitions are designed to address prefetch failures and cache contamination.

1, pre-read failure: because the pre-read page, assuming that the pre-read page will not be used, so it is nothing to pre-read, so let the pre-read page is inserted in the old head, elimination is also eliminated from the old tail. Does not affect new generation data.

2. Cache contamination: a lot of cold data will be read in a full table scan like this. And some query frequency is very few, so these data only exists old s, and then quickly eliminated is the right choice, MySQL in order to solve this problem, it is not enough to generational, and has set up a time window, the default is 1 s, that was in the old s visit and there are more than 1 s again, will be promoted to a new generation, So you don’t contaminate the thermal data for the next generation.

summary

The article starts from the index, which is the exchange of time and space. Then introduce Kafka index storage uses relative displacement value, saving space, and describes the index access is achieved by binary search, and combined with the use of Kafka scenarios to explain the use of hot and cold partition Kafka to achieve the improved version of binary search, and mentions the next consistency Hash. Then the LRU management of MySQL buffer pool deformation is associated with hot and cold partition.

This step by step actually reflects the flexible application and deformation of the algorithm in engineering. Some students think the algorithm is useless and brush the algorithm just for the interview. In fact, various middleware and some low-level implementations reflect the importance of the algorithm.

Anyway, my head is a little cold.