Starting with this article, I’m going to take a deep look at Kafka’s design principles from a micro perspective. This article is about the most representative of Kafka: storage design.

If you are not familiar with Kafka’s storage design, you may wonder: why Kafka uses Logging as a primitive way to store messages, and does not consider databases or KV for storage? For those of you who know something about Kafka, you can quickly identify Append Only, Linear Scans, sequential disk writes, page caching, zero copy, sparse indexing, binary lookup, etc.

I plan to write two articles that will not only clarify the above confusion, but also give you a context that will help you quickly get to the point of Kafka storage design, and then tie the pieces together. In addition, we also hope that after understanding Kafka’s storage design, you can have a deeper understanding of Append Only Data Structures, a classic underlying storage principle, because it has driven too many influential storage systems to success in the industry. Examples include HBase, Cassandra, RocksDB, etc

What are the storage challenges of Kafka?

Why is storage design the essence of Kafka? As analyzed in this article, Kafka has reduced itself to a massive message storage system by simplifying the message model. As Kafka subtracts from other features, it is inevitable that it will work on storage to achieve performance that other MQ cannot match.



Figure 1: Kafka’s message model

But before we get into Kafka’s storage scheme, it’s worth trying to understand why Kafka is Logging. What is the basis of its selection?

That’s what this series is trying to do: think better than remember, ask why more often than memorize what.

Kafka’s storage selection logic, I think, is similar to how we develop our business needs: MySQL, Redis, or other storage solutions? It depends on the business scenario.

Let’s try to analyze it from the following two dimensions:

  • 1. Functional requirements: What data is stored? How about the magnitude? How long do I need to keep it? What are the CRUD scenarios?
  • 2. Non-functional requirements: What are the performance and stability requirements? Should you consider scalability?

Going back to Kafka, its functional requirements include at least the following:

The data stored is mainly message flows: messages can be simple text strings or complex custom formats.

For the Broker, however, it simply processes the delivery of good news, not the message content itself.

The amount of data is huge: Because Kafka was created as an incubator project for Linkedin as a real-time log stream (a burial point for operational activities, an operational monitoring indicator, etc.), Linkedin’s original business would process hundreds of billions of messages a day.

The CRUD scenario is simple enough: because the core function of a message queue is a data pipeline, which provides only dump capability, CRUD operations are simple indeed.

First, messages are the equivalent of notification events, and are appended, without considering updates at all. Second, on the Consumer side, the Broker provides the ability to query messages at offset or timestamp. Again, messages that have not been consumed for a long time (say, seven days old) are deleted regularly.

Next, let’s look at non-functional requirements:

Performance requirements: As mentioned in the previous article, Linkedin initially tried using ActiveMQ to solve the data transfer problem, but the performance was not up to the requirements before they decided to develop Kafka themselves. The single-machine throughput of ActiveMQ is about ten thousand TPS, and Kafka is obviously one order of magnitude higher than ActiveMQ.

Stability requirements: The ability to persist messages (to ensure that historical data is not lost after a machine is restarted) and how quickly a single Broker can fail over and continue serving the outside world are also factors that Kafka must consider.

Scalability requirements: Kafka is faced with massive data storage problems, storage scalability must be considered.

To summarize, Kafka’s storage requirements are as follows:

  • 1. Functional requirements: Simple enough to appending, no update, query messages based on consumption displacement and timestamps, and periodically delete expired messages.
  • 2. Non-functional requirements: This is where the difficulty lies, because Kafka itself is a highly concurrent system and is bound to encounter the typical three challenges of high performance, high availability, and high scalability.

Storage selection analysis of Kafka

With the above requirements sorted out, let’s move on to the next analysis. Why did Kafka end up using logging to store messages? Instead of using our most common relational database or key-value database?

2.1 Basic Knowledge in the storage field We will first popularize some basic knowledge in the storage field, which is the theoretical basis for further analysis.

  • 1, memory access speed, but small capacity, expensive, not suitable for long-term storage of data.
  • 2. Disk access is relatively slow, but it is cheap and persistent.
  • 3. The time spent on disk I/O depends on the seek time and disk rotation time. The most effective way to improve disk I/O performance is to reduce random I/O and increase sequential I/O.
  • Disk I/O is not necessarily slower than memory, depending on how we use it.

As for disk and memory I/O speed, there are many comparative tests in this area, the results show that: disk sequential write speed can reach hundreds of megabits /s, while random write speed is only a few hundred KB/s, the difference is thousands of times. In addition, sequential DISK I/O access can even exceed the performance of random MEMORY I/O.



Figure 2: DISK vs. memory IO speed comparison

In the field of data storage, there are two “extreme” development directions: 1, speed up read: through the index (B+ tree, two search tree, etc.), improve the query speed, but when writing data to maintain the index, so it will reduce the write efficiency.

2, speed up the write: pure log, data to append the way of sequential writing, without index, so that the write speed is very high (theoretically close to the disk write speed), but the lack of index support, so the query performance is low.

Based on these two extremes, three types of the most representative underlying index structures are derived: 1. Hash index: the hash function maps the key to the storage address of the data, which is suitable for simple scenes such as equivalent query, but powerless for complex scenes such as comparison query and range query.

2, B/B+ Tree index: the most common index type, focusing on the read performance, it is the underlying structure of many traditional relational databases, such as MySQL, Oracle.

3. LSM Tree index: Data is appended to log files in Append mode, which optimizes write but does not significantly reduce read performance. The underlying structure of many NoSQL storage systems such as BigTable, HBase, Cassandra, and RocksDB.

With these theoretical foundations in mind, we continue to think about Kafka’s storage requirements.

The characteristics of Kafka’s business scenario are:

  • 1, write operations: very high concurrency, mega TPS, but sequential write, no need to consider updates
  • 2. Query operation: simple requirement, can query messages according to offset or TIMESTAMP

Append is the best way to Append a log file to Kafka’s megabytes of TPS.

All that remains is how to solve the problem of efficient queries. If the index structure of the B Tree class is used, the index needs to be maintained every time data is written (a random I/O operation), and time-consuming operations such as page splitting may occur. These costs are heavy for Kafka, which only needs to implement simple query requirements. Therefore, the index of the B Tree class is not applicable to Kafka.

Hash indexes, on the other hand, look pretty good. To speed up the read operation, if you only need to maintain a mapping in memory from offset to log file offset, each time you look up a message based on offset, you get the offset from the hash table and read the file. (The same idea can be used to query messages based on timestamp)

However, the hash index is resident in memory, which is obviously not capable of handling large data volumes. Kafka can write millions of messages per second, which is bound to overwhelm memory.

We found that the offset of the message could have been ordered (actually a monotonically increasing long), so that the message itself would have been ordered in the log file, and we could have divided the message into blocks instead of having a hash index for each message. Only the offset of the first message in each block is indexed. Blocks are found by size and then searched sequentially within blocks. This is where Kafka’s “sparse indexing” comes from.



Figure 3: Sparse index diagram for Kafka

Finally, we find that Append and sparse hash indexes form Kafka’s final storage scheme. Isn’t that the idea behind LSM Tree?

One might argue that Kafka is different from LSM Tree in that it does not use the Tree index and Memtable layer. But I personally think that from a “design point of view” Kafka can be seen as an extreme application of LSM Tree. In addition, for Append Only Data Structures and LSM Tree, I recommend Ben Stopford (a technologist from Kafka’s parent company) to share a video at QCon 2017, which is excellent and worth checking out.

Kafka’s storage design

After understanding the ins and outs of Kafka storage selection, we finally take a look at its specific storage structure.



Figure 4: Kafka’s storage structure

Kafka is a Partition + Partition + index structure. Each Topic is divided into multiple partitions, which can be physically understood as a folder. As explained in the previous article, partitions are designed to solve the problem of horizontal scaling in Kafka storage. If a Topic has only one Broker for all messages, that Broker is bound to become a bottleneck. Therefore, it is a natural design approach to divide the data within a Topic into partitions and then distribute it across the cluster. 2. Each Partition is divided into multiple segments. Physically, the Segment can be interpreted as a “data file + index file”. Why do we need segments after Partition? If no Segment is introduced, a Partition corresponds to only one file, and the file will always grow. This will inevitably cause a single Partition file to be too large, making it inconvenient to find and maintain. In addition, when deleting historical messages, it is necessary to delete the content in front of the file. After introducing segments, delete the old Segment file to ensure that each Segment is written in sequence.

Write in the last

This article explains Kafka’s choice of logging storage solution step by step, from requirements analysis, selection comparison, to specific storage solutions. It is also hoped that you can take the initiative to think about Kafka’s difficulties in storage selection, as a system design problem to think, rather than just remember that it uses log storage. Another idea: the lower the level, the more common it is, and every time you dig a little deeper, you’ll find that this knowledge is common to many good open source systems.