It is generally known that there are three main points:

  • Order to write
  • Page caching
  • Zero copy

Order to write

Sequential write can only be satisfied if the append operation is the majority of the scenario. Since the earliest days of GFS, greater efficiency has been achieved by taking away the right to write randomly, that is, to modify files. All the terms we see in Kafka, such as log, segment, compact, are derived from a data structure called an LSM tree. The LSM tree design article was popularized thanks to Google’s Bigtable — which is why you can see so many similarities between HBase and Kafka.

An LSM tree is a disk indexed data structure that corresponds to an evergreen, B+ tree, in a database. The essence of B+ tree is balanced tree, which optimizes data query by establishing efficient sorting structure. But it writes on a disk block basis, and even in the best case (without splitting), it writes randomly.

Before introducing sequential writing in LSM, it is important to understand what sequential writing really is. True sequential reading and writing means that the disk’s head doesn’t have to change lanes (we all know the elevator algorithm, but we won’t go into that). That is, the allocated disk blocks are contiguous. But that may not be entirely guaranteed. However, LSM ensures that whenever the operating system allocates contiguous space, it will write sequentially, rather than jumping to a previous location for random writes. How do you allocate continuous space? This file is as big as possible. The LSM tree uses an app-only file to record all data changes, known as a log.

In Kafka, this file is scrolled every 1G to generate a new one. This is because logs are not infinitely large and can be easily recycled by splitting them into large segments. This is a compromise between sequential writes and disk space management.

So far, the LSM tree seems pretty simple. But consider, how do you query the data? What if you need to update the data (although you don’t need to worry about that in Kafka)?

For an app-only file, updates are still appended with new entries, similar to a state machine. Accordingly, the query needs to find the last status in the log. This doesn’t seem too difficult to solve — we need an in-memory index whose key is the key of the data and whose value is the offset of the data in the file. Because logs are shelled, each segment is assigned an index, and each query always starts with the latest segment.

The data structure of this index is worth thinking about. It can be a hash index or a sequential index — but hashes cannot be scoped, and all keys must be loaded. Sequential indexes can be sparse, but that requires the data to be sorted — to satisfy this, the LSM tree designs segments as sorted structures such as sstAbles. Because disk writes are batch, the sort structure can be built in memory before being written to disk. This sort structure is usually a hop list, because the concurrency of balanced trees is poor.

However, Kafka doesn’t need to worry about these issues because it doesn’t update or even query. It determines the read location by explicitly saving the serial number of the consuming message, without querying the data by the key of the message. So Kafka doesn’t even need this sorting process, it just takes this sequence number as a key.

Page caching

The concept of virtual memory and swap areas may not be familiar to current computer users. In the old days, when memory was very small, it was easy for computers to become jammed because there were too many applications open. Even these applications can run at the same time, thanks to the fact that virtual memory expands the size of the program with disk space. Because of the frequency of address translation, there is a dedicated hardware device, the MMU, to handle it (Intel CPU had a bug related to paging). With swap in and swap out, you have a cache, which is the term we see most often in Kafka, page cache.

Operating systems provide file systems with a type of cache called a buffer cache. In the early days of Linux, there were both page and buffer caches — not very scientific, of course, so they were later combined:

What is the major difference between the buffer cache and the page cache? Why were they separate entities in older kernels? Why were they merged later on?

In a sense, we can think of the system-provided cache as the Page cache. Kafka does something very simple, which is a one-size-fits all approach: it doesn’t use any code-level caching and relies purely on the Page cache. This is related to the appending nature of the message system: Since the Page cache uses an LRU-like page replacement algorithm, recently written messages can remain in the Page cache waiting to be consumed. This also means that if a consumer is not excessively lag, it can always read from the Page cache. FileRecord (FileMessageSet) is the code for Kafka to write logs.

There is also a term commonly used when referring to Page cache — memory mapping. It means that a process can map files to its own virtual memory space using system calls. In addition, this mapping is lazy loading, that is, the corresponding page is only loaded when the page is missing. But what does this have to do with page cache? Memory mapping — the biggest feature of mmap for system calls is the direct use of page cache:

Page Cache, the Affair Between Memory and Files

The differences are as follows:

Why is it that mmap can use the Page cache directly, but read is copied from the kernel buffer for normal reading and writing? One reason is that the user buffer provided by normal reads and writes can be swapped out (since it is itself virtual memory). But MMap is not usually a choice for reading and writing files, especially if the corresponding page of the file is already in the page table. In Kafka, only index files use Mmap; log files do not.

Zero copy

Zero copy analysis of various articles, but the key is to understand why ordinary read and write requires 4 copies and 2 context switches.

Kafka calls methods such as Filechannel. transferTo, which is essentially a system call to sendFile. With DMA support, it is possible to do zero copy — that is, copy directly from external memory to a memory-specified buffer, without copying from the input buffer to the output buffer (such as socket).

In addition to Sendfile, there are many other zero-copy technologies, including Mmap.


Of course, Kafka also has many good designs, such as batch compression, Reactor patterns, and so on. But we all mention very little, not to say.