Why can so fast | Kafka Kafka efficient reading and writing data

Kafka, either as MQ or as a storage layer, has two simple functions: a Producer stores data into the broker, and a Consumer reads data from the broker. Kafka is fast in reading and writing. Let’s talk about why Kafka is fast.

1. Implement parallel processing using Partition

We all know that Kafka is a pub-sub messaging system, and whether you publish or subscribe, you specify a Topic.

Topic is just a logical concept. Each Topic contains one or more partitions, which can be located on different nodes.

On the one hand, since different partitions can be located on different machines, clustering can be fully utilized to achieve parallel processing between machines. On the other hand, partitions correspond to a folder physically. Even if multiple partitions reside on the same node, you can configure different partitions on the same node to reside on different disks. In this way, parallel processing between disks can be implemented to make full use of the advantages of multiple disks.

Can parallel processing, speed will certainly be improved, multiple workers are certainly faster than a worker.

Can you write to different disks in parallel? Can the disk read and write speed be controlled?

Let’s start with the disk /IO stuff

What are the constraints on disk performance? How to design a system based on disk I/O features?

The main internal components of hard disk are disk disk, drive arm, read and write magnetic head and spindle motor. The actual data is written on the disk, and reading and writing is mainly accomplished by driving the read and write magnetic head on the arm. In practice, the spindle rotates the disk, and the drive arm extends to allow the reading head to read and write on the disk. The following figure shows the physical structure of the disk:

Due to the limited capacity of a single disk, a hard disk generally has more than two disks. Each disk has two sides and can record information. Therefore, a disk corresponds to two magnetic heads. The disk is divided into a number of fan – shaped areas, each called a sector. On the surface of the disk, the center of the disk is the center of the circle. Concentric circles with different radii are called tracks, and the columns composed of tracks with the same radii of different disks are called cylinders. Track and cylinder are circles representing different radii, and in many cases, track and cylinder are used interchangeably. The vertical viewing Angle of the disk is as follows:

Image credit: Commons.wikimedia.org

The key factor affecting the disk is disk service time, which is the time it takes for the disk to complete an I/O request. It consists of three parts: seek time, rotation delay, and data transfer time.

The continuous read/write performance of mechanical hard disks is good, but the random read/write performance is poor. This is mainly because it takes time for the head to move to the correct track. In random read/write, the head is constantly moving and time is wasted in head addressing. The major indicators used to measure disks are IOPS and throughput.

Many open source frameworks, such as Kafka and HBase, use appending to convert random I/ OS to sequential I/ OS as much as possible to reduce addressing time and rotation delay and maximize IOPS.

For those of you who are interested, take a look at disk I/O

The speed of disk reads and writes depends on how you use it, that is, sequential or random reads and writes.

2. Write disks in sequence

Image credit: kafka.apache.org

In Kafka, each partition is an ordered, immutable sequence of messages. New messages are continuously appended to the end of the partition.

Benchmarks were done a long, long time ago: The written about 2 million (on the three cheap machine) http://ifeve.com/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines/

Due to the limited number of disks, it is impossible to store all data, and in fact Kafka as a messaging system does not need to store all data. Old data needs to be deleted. Instead of using read-write mode to modify a file, Kafka removes data by dividing it into multiple segments, each corresponding to a physical file. Delete data from a Partition by deleting the entire file. This eliminates old data and avoids random writes to files.

3. Make full use of Page Cache

The Cache layer is introduced to improve the performance of the Linux operating system on disk access. The Cache layer caches part of the data on disk in memory. When a request for data arrives, if the data exists in the Cache and is up-to-date, the data is directly passed to the user program, eliminating operations on the underlying disks and improving performance. The Cache layer is one of the main reasons why disk IOPS exceeds 200.

In Linux, the file Cache is divided into two layers, a Page Cache and a Buffer Cache, and each Page Cache contains several Buffer caches. Page Cache is used to Cache file data on the file system, especially when a process is performing read/write operations on the file. The Buffer Cache is designed for use by systems that Cache data from blocks while reading or writing to block devices.

Advantages of using Page Cache:

  • The I/O Scheduler assembles sequential small blocks of writes into large physical writes to improve performance

  • The I/O Scheduler tries to reorder some writes to reduce disk head movement time

  • Make full use of all free memory (non-JVM memory). If you use application-layer Cache (that is, JVM heap memory), you add to the GC burden

  • Read operations can be performed directly in the Page Cache. If the consumption and production rates are comparable, data does not even need to be exchanged through physical disks (directly through the Page Cache)

  • If the process is restarted, the Cache in the JVM is invalidated, but the Page Cache is still available

When the Broker receives data, it writes the data to the Page Cache. There is no guarantee that the data will be written to disk. From this point of view, it is possible that the machine will go down and the data in the Page Cache will not be written to disk, resulting in data loss. However, this loss only occurs when the operating system does not work due to machine power outages, which can be resolved by the Replication mechanism at the Kafka level. Forcing data from the Page Cache to disk to ensure that data is not lost in this case can degrade performance. Because of this, Kafka does not recommend using flush. Messages and flush. Ms parameters to forcibly flush data from the Page Cache to disk.

Zero copy technology

In Kafka, there are a lot of network data persisting from Producer to Broker and disk files being sent from Broker to Consumer over the network. The performance of this process directly affects Kafka’s overall throughput.

The core of the operating system is the kernel, which is independent of ordinary applications and has access to the protected memory space as well as access to the underlying hardware devices.

To prevent User processes from directly operating the Kernel and ensure Kernel security, the operating system divides the virtual memory into two parts: kernel-space and user-space.

In traditional Linux systems, standard I/O interfaces (such as read and write) are based on data copy operations. That is, I/O operations cause data to be copied between the buffer of the kernel address space and the buffer of the user address space. Therefore, standard I/O is also called cache I/O. This has the advantage of reducing the actual I/O operations if the requested data is already in the kernel’s cache, but has the disadvantage of causing CPU overhead in the process of copying the data.

We simplify the production and consumption of Kafka into two processes:

  1. Persisting network data from Producer to Broker
  2. Disk files are sent over the network (Broker to Consumer)
4.1 Persisting Network Data to Disks (Producer to Broker)

In traditional mode, transferring data from the network to a file requires four data copies, four context switches, and two system calls.

data = socket.read()// Read network data
File file = new File() 
file.write(data)// Persist to disk
file.flush()
Copy the code

This process actually takes place four times:

  1. The network data is first copied to the kernel-formatted Socket Buffer via DMA copy
  2. The application then reads the kernel-state Buffer data into the user-state (CPU copy).
  3. The user program then copies the user-mode Buffer to the kernel state (CPU copy).
  4. Finally, the data is copied to a disk file via DMA copy

DMA (Direct Memory Access) : Direct Memory Access. DMA is a hardware mechanism that allows two-way data transfer between peripherals and system memory without the CPU’s involvement. Using DMA can greatly improve system throughput by removing the system CPU from the actual I/O data transfer process.

This is accompanied by four context switches, as shown in the figure below

Data falls are usually non-real-time, as is Kafka producer data persistence. Instead of writing data to disk in real time, Kafka takes advantage of modern operating systems’ paging storage to use memory for I/O efficiency, the Page Cache mentioned in the previous section.

For Kafka, the data produced by Producer is stored in the broker, which reads the network data from the socket buffer and can be dropped directly into the kernel space. There is no need to read network data from the socket buffer into the application process buffer. The process buffer applied here is essentially the broker, which receives the producer’s data for persistence.

In this special scenario, the application process receives network data from the socket buffer without intermediate processing and directly carries out persistence. Mmap memory file mapping can be used.

The purpose of mmap is to map the addresses of read buffers in the kernel to user buffers in user space. In this way, the kernel buffer is shared with the application memory, eliminating the need to copy data from the kernel read buffer to the user buffer. It works by directly using the Page of the operating system to achieve a direct mapping of files to physical memory. After the mapping is complete, your operations on physical memory will be synchronized to the hard disk.

You can get a big I/O boost in this way, eliminating the overhead of user-to-kernel space replication.

Mmap also has an obvious drawback – it is unreliable. Data written to Mmap is not actually written to disk, and the operating system does not write data to disk until the program initiates a flush call. Kafka provides an argument — producer.type to control whether or not it is actively flush. If Kafka writes to the Mmap, it flush immediately and returns the Producer as sync. After writing to mmap, returning Producer without calling Flush is called async. The default value is sync.

Zero-copy technology means that the CPU does not have to copy data from one memory region to another when performing an operation on the computer, thus reducing context switching and CPU copy time.

Its function is to reduce the number of data copies and system calls during the transmission of data packets from network devices to user program space, so as to realize the zero participation of CPU and completely eliminate the load of CPU in this aspect.

There are three main types of zero-copy technology:

  • Direct I/O: Data passes directly across the kernel between the user’s address space and the I/O device. The kernel only performs necessary auxiliary tasks such as virtual storage configuration.
  • Avoid data copying between kernel and user space: You can avoid copying data from kernel space to user space when the application does not need to access the data
  • mmap
  • sendfile
  • splice && tee
  • sockmap
  • Copy on Write: Copy on write technology. Data is partially copied when it needs to be modified rather than copied in advance.
4.2 Disk Files sent over the Network (Broker to Consumer)

Traditional way to achieve: read the disk first, and then send socket, is actually four times into the copy

buffer = File.read 
Socket.send(buffer)
Copy the code

This process can be analogous to the production message above:

  1. First read the file data into the kernel-formatted Buffer (DMA copy) via a system call
  2. The application then reads the in-memory Buffer data into the user-memory Buffer (CPU copy)
  3. The user program then copies the user Buffer to the kernel Buffer (CPU copy) when sending data over the Socket.
  4. Finally, the data is copied to the NIC Buffer via DMA copy

The Linux 2.4+ kernel provides zero copy via the SendFile system call. After the data is DMA copied to the kernel Buffer, the data is directly DMA copied to the NIC Buffer without CPU copying. This is where the term zero copy comes from. In addition to reducing data copying, performance is greatly improved because the entire read-file-network send is done by a single sendFile call with only two context switches.

Kafka’s solution here is to use NIO’s transferTo/transferFrom to call the operating system’s sendfile for zero copy. A total of two kernel data copies, two context switches, and one system call occurred, eliminating CPU data copies

5. The batch

In many cases, the bottleneck is not CPU or disk, but network IO.

Thus, in addition to the low-level batch processing provided by the operating system, Kafka’s clients and brokers accumulate multiple records (both read and write) in a batch before sending data over the network. Recorded batching spreads the cost of network round-trips, using larger packets and improving bandwidth utilization.

6. Data compression

Producer compresses data and sends it to the broker to reduce the cost of network transmission. Currently, the supported compression algorithms include Snappy, Gzip, and LZ4. Data compression is often used in conjunction with batch processing as an optimization tool.

Small summary | the interviewer asked me why kafka, next time I will say so

  • Partition parallel processing
  • Write data to disks in sequence to make full use of disk features
  • Modern operating systems use paging storage Page Cache to utilize memory to improve I/O efficiency
  • Zero copy technology is used
  • Data produced by Producer is persisted to brokers and mapped to Mmap files to achieve sequential fast writing
  • Customer reads data from the broker, uses sendFile, reads the disk file to the OS kernel buffer, and sends it to NIO buffer for network transmission to reduce CPU consumption