Study kafka for a week, there is a reason why he is so fast!

preface

Whether Kafka serves as MQ or as a storage layer, it has two functions (very simple) : one is for the Producer to store the data to the broker, and the other is for the Consumer to read the data from the broker. That Kafka fast also reflected in the reading and writing of two aspects, let’s talk about Kafka fast reasons.

1. Use Partition to implement parallel processing

As we all know, Kafka is a pub-sub messaging system that specifies topics for both publishing and subscribing.

Topic is just a logical concept. Each Topic contains one or more partitions, which can be located on different nodes.

On the one hand, because different partitions can be located on different machines, the advantages of cluster can be fully utilized to realize parallel processing between machines. On the other hand, a Partition corresponds to a folder physically. Even if multiple partitions reside on the same node, you can configure them to reside on different disks on the same node to implement parallel processing between disks and take full advantage of multiple disks.

If you can do it in parallel, you can do it faster, and multiple workers can do it faster than one worker.

Can I write to different disks in parallel? Can disk read and write speed be controlled?

Let’s just start with the disk /IO stuff

Why is Kafka so fast? Read and write data efficiently, so this is how to do it

What are the constraints on hard disk performance? How to design a system based on disk I/O features?

The main internal components of the hard disk are disk platter, transmission arm, read and write head and spindle motor. The actual data is written on the platter, read and write is mainly through the transmission arm read and write magnetic head to complete. In actual operation, the spindle rotates the disk, and the drive arm can be extended to allow the reader head to read and write on the disk. The following figure shows the physical structure of the disk:

Due to the limited capacity of a single platter, a common hard disk has more than two platters, each platter has two sides, can record information, so a platter corresponds to two heads. The platter is divided into many fan-shaped areas, each called a sector. On the surface of the disk disk center as the center of the circle, concentric circles of different radii are called tracks, different disk of the same radius of the track composed of cylinder is called cylinder. Tracks and cylinders are circles of different radii. In many cases, tracks and cylinders are used interchangeably. The vertical view of disk platters is shown in the figure below:

What are the key factors that affect disk?

The key factor affecting disks is disk service time, or the time it takes a disk to complete an I/O request, which consists of seek time, rotation delay, and data transfer time.

The continuous read/write performance of mechanical hard disks is good, but the random read/write performance is poor. This is mainly due to the time required for the head to move to the correct track. The random read/write performance is not high because the head needs to move constantly and the time is wasted on the head addressing. The key metrics for measuring disks are IOPS and throughput.

Many open source frameworks, such as Kafka and HBase, use appending to convert random I/O to sequential I/O to reduce addressing time and rotation latency and maximize IOPS.

Those of you who are interested can look at the disk I/O thing

The speed of disk reading and writing depends on how you use it, which is sequential or random reading and writing.

2. Write disks sequentially

Kafka’s partition

Every partition in Kafka is an ordered, immutable sequence of messages, and new messages are appended to the end of the partition. This is called sequential write.

Since disks are limited, it is not possible to hold all data, and in fact as a messaging system Kafka does not need to hold all data, the old data needs to be deleted. Because of sequential writes, Kafka uses various delete policies to delete data, rather than using “read-write” mode to modify the file, instead dividing the Partition into multiple segments, each of which corresponds to a physical file. Delete the data in the Partition by deleting the entire file. This way to clear old data, also avoid random write operations to the file.

3. Make full use of Page Cache

The Cache layer is introduced to improve disk access performance of the Linux operating system. The Cache layer caches some data on disks in the memory. When a request for data arrives, if the data exists in the Cache and is up to date, the data is directly passed to the user program, eliminating operations on underlying disks and improving performance. The Cache layer is one of the main reasons why the disk IOPS exceeds 200.

In Linux, the file Cache is divided into two layers, one is the Page Cache, the other is the Buffer Cache, each Page Cache contains several Buffer Cache. The Page Cache is used to Cache file data on the file system, especially when the process has read/write operations on the file. The Buffer Cache is designed primarily for use by systems that Cache data from blocks when the system reads or writes to a block device.

Benefits of using Page Cache:

I/O Scheduler improves performance by assembling contiguous small writes into larger physical writes
The I/O Scheduler attempts to reorder some of the writes to reduce the time to move the disk heads
Make full use of all free memory (non-JVM memory). If application layer Cache (that is, JVM heap memory) is used, the GC burden will increase
Read operations can be performed directly from the Page Cache. If consumption and production speeds are comparable, you don’t even need to exchange data through physical disks (directly through the Page Cache)
If the process restarts, the Cache in the JVM will fail, but the Page Cache will remain available

When the Broker receives data, it only writes the data to the Page Cache. There is no guarantee that the data will be completely written to disk. From this point of view, data in the Page Cache may not be written to disk during machine downtime, resulting in data loss. However, this loss only occurs when the operating system is not working, such as when the machine is powered off, and this can be fully addressed by Replication at the Kafka level. If you forcibly Flush the Page Cache to disk to ensure that data is not lost in this case, performance will suffer. For this reason, Kafka provides two arguments, Flush. messages and flush.ms, to force data from the Page Cache to disk, but Kafka does not recommend using them.

Zero copy technology

Kafka has a large amount of network data persisting to disks (Producer to Broker) and disk files sent over the network (Broker to Consumer). The performance of this process directly affects Kafka’s overall throughput.

The core of the operating system is the kernel, which is independent of ordinary applications and has access to protected memory space as well as access to underlying hardware devices.

To prevent User processes from directly manipulating the Kernel and ensure Kernel security, the operating system divides the virtual memory into two parts: kernel-space and user-space.

In traditional Linux systems, standard I/O interfaces (such as Read and write) are based on data copy operations. That is, the I/O operation causes data to be copied between the buffer in the kernel address space and the buffer in the user address space. Therefore, standard I/O is also called cache I/O. This has the advantage of reducing actual I/O if the requested data is already in the kernel’s cache, but has the disadvantage of the CPU overhead associated with the data copying process.

Kafka simplifies production and consumption

Let’s simplify the production and consumption of Kafka into the following two processes [2] :

Persisting network data to disks (Producer to Broker)
Disk files are sent over the network (Broker to Consumer)

4.1 Network Data persistence to Disks (Producer to Broker)

In traditional mode, transferring data from a network to a file requires four data copies, four context switches, and two system calls.

Data = socket.read() File File = new File() file.write(data)// Persist to disk file.flush()Copy the code

This process actually takes place four times:

Firstly, the network data is copied to the kernel Socket Buffer through DMA copy
The application then reads the kernel Buffer into the user mode (CPU copy).
The user program then copies the user Buffer to the kernel (CPU copy).
Finally, the data is copied to a disk file via DMA copy

Direct Memory Access (DMA) : Direct Memory Access. DMA is a hardware mechanism for two-way data transfer between a peripheral and system memory without the involvement of a CPU. Using DMA can make the system CPU out of the actual I/O data transfer process, thus greatly improving the throughput rate of the system.

This is accompanied by four context switches, as shown in the figure below

Data dumping is usually non-real-time, and the same is true of Kafka producer data persistence. Instead of writing data to the hard disk in real time, Kafka makes full use of modern operating systems for paging storage to improve I/O efficiency. This is called Page Cache in the previous section.

In Kafka, the Producer stores the data to the broker, which reads the network data from the socket buffer, and then directly dumps the data in the kernel space. There is no need to read network data from the socket buffer into the application process buffer; The process buffer applied here is the broker, which receives the data from the producer for persistence purposes.

In this special scenario: The application process directly persists the network data received from the socket buffer without intermediate processing. You can use MMAP memory file mapping.

Memory Mapped Files: Mmap, also called MMFile, is used to map read buffer addresses in the kernel to user buffers. In this way, the kernel buffer and application memory can be shared, and the process of copying data from the kernel read buffer to the user buffer is eliminated. It works by directly using the operating system Page to achieve a direct mapping of files to physical memory. Once the mapping is complete, your actions on the physical memory will be synchronized to the hard disk.

Using this approach, you can get a large I/O boost without the overhead of copying from user space to kernel space.

The defect of mmap

Mmap also has one obvious pitfall — it’s unreliable. Data written to mmap is not actually written to hard drive. The operating system only writes data to hard drive when the program fluses. Kafka provides an argument — producer.type — to control whether an active flush is enabled; If Kafka flushes immediately after writing to mMAP and then returns to Producer, sync is called. After mMAP is written, it immediately returns to Producer if it does not call Flush, it is called async. By default, sync is used.

Zero-copy is a technique in which the CPU does not need to copy data from one region of memory to another when a computer performs an operation, thus reducing context switching and CPU copy time.

Its function is in the data packet from the network device to the user program space transfer process, reduce data copy times, reduce system calls, realize the CPU zero participation, completely eliminate the CPU in this aspect of the load.

At present, there are three main types of zero-copy technology [3] :

Direct I/O: Data is transferred directly across the kernel between the user address space and the I/O device, with the kernel performing auxiliary tasks such as necessary virtual storage configuration.

Avoid data copying between kernel and user space: When the application does not need to access the data, you can avoid copying data from kernel space to user space

mmap

sendfile

splice && tee

sockmap

Copy on write: The technology of copying data while writing data. Instead of copying data in advance, data is copied only when it needs to be modified.

4.2 Sending Disk Files over the Network (Broker to Consumer)

In the traditional mode, the disk is read first and then sent through the socket. In fact, the disk is copied four times

buffer = File.read Socket.send(buffer)
Copy the code

This process can be likened to the production message above:

First, the file data is read into the kernel-mode Buffer (DMA copy) via a system call
The application then reads the in-memory Buffer data into the user-mode Buffer (CPU copy).
The user program then copies the user Buffer to the kernel Buffer (CPU copy) when sending data through the Socket.
Finally, the data is copied to the NIC Buffer through DMA copy

The Linux 2.4+ kernel provides zero copy via the sendfile system call. After data is copied to the kernel Buffer through DMA, it is directly copied to the NIC Buffer through DMA without CPU copying. This is where the term zero copy comes from. In addition to reducing data copying, since the entire read-file-network sending is done by a single sendfile call, there are only two context switches throughout the process, thus greatly improving performance.

Kafka’s solution here is to use NIO’s transferTo/transferFrom call to the operating system’s SendFile to achieve zero copy. In total, two kernel data copies, two context switches, and one system call occurred, eliminating the CPU data copy

5. The batch

In many cases, the system bottleneck is not CPU or disk, but network IO.

Thus, in addition to the low-level batch processing provided by the operating system, Kafka’s clients and brokers accumulate multiple records (both read and write) in a batch before sending data over the network. Recorded batch processing apportions the overhead of network round-trips, using larger packets and thus improving bandwidth utilization.

6. Data compression

The Producer can compress data and send it to brokers to reduce network transmission costs. Currently, the following compression algorithms are supported: Snappy, Gzip, and LZ4. Data compression is generally used in conjunction with batch processing as an optimization tool.

conclusion

What if, again, the interviewer asks me: Why is Kafka so fast? I would say this:

Partition parallel processing
Write sequentially to disks to take full advantage of disk features
Modern operating systems use Page Cache for paging storage to utilize memory for I/O efficiency
Zero copy technology is adopted
The data produced by the Producer is persisted to the broker. Mmap file mapping is adopted to realize sequential and fast writing
Customer reads data from the broker and uses sendfile to read disk files to the OS kernel buffer before transferring them to NIO buffer for network sending, reducing CPU consumption

Xiaobian to share the content here is the end!

Small make up here some Java core technology knowledge points collection of information and Kafka interview questions

Public number: Kirin to change the bug