Kafka online real environment combat and tuning advanced series

  • Kafka business environment combat – Kafka production environment planning
  • Kafka Business Environment in action – Kafka producer and consumer throughput test
  • Kafka business environment combat – Kafka Producer parameter setting and parameter tuning suggestions

This series of blogs summarizes and shares cases extracted from real business environments, and gives advice on tuning Kafka business applications and capacity planning for cluster environments. Please continue to follow this series of blogs. This set of Kafka tuning series copyright belongs to the author (Qin Kai Xin) all prohibited reprint, welcome to learn.

Kafka real environment deployment plan

1. Operating system selection

Kafka server code is developed in Scala language, so it belongs to the BIG data framework of the JVM system. Currently, the most three operating systems deployed are mainly Linux, OS X and Windows, but the most deployed in Linux, why? Because of the use of I/O model and data network transmission efficiency two points.

  • Kafka’s new Clients edition uses the Java Select model to design the underlying network library. The Linux epoll model is used to implement the epoll model. Kafka runs more efficiently on Linux because epoll eliminates the polling mechanism and replaces it with a callback mechanism to avoid wasting CPU time when the number of underlying sockets is high.
  • Second: network transmission efficiency. Kafka requires data transfer over the network and disk. Most operating systems use Java’s Filechannel. transferTo method. Linux calls sendFile, or Zero Copy. Duplicate copying of data in the kernel address space and user program space is avoided.

2. Plan the disk type

  • Mechanical disks (HDDS) generally have millisecond seek times and exponential latency with a large number of random I/ OS. However, Kafka reads and writes sequentially, so the performance of mechanical disks is not weak, so the cost can be considered.
  • Solid-state drives (SSDS) have good read and write speeds and no cost to consider.
  • The Just Bunch Of Disks (JBOD) solution is cost-effective and can be used when the data security level is not very high. You are advised to configure multiple log paths on the Broker server, and each path is mounted on different Disks to greatly improve the speed Of concurrent log writing.
  • A common RAID array is RAID10, or RAID1 +0. This array combines disk mirroring with disk striping to protect data. Because of disk mirroring, usage is only 50%. So what’s the downside? If the number of Kafka replicas is set to 3, the data is actually 6 times redundant, which is too low utilization. As a result, LinkedIn is planning to change the solution to JBOD.

3. Plan the disk capacity

Our company’s Internet of Things platform can generate about 100 million messages every day. Suppose replica is set to 2 (in fact, we set it to 3), the data retention time is 1 week, and each reported event message is about 1K on average. Then the total amount of messages generated every day is: 100 million x 2 x 1K divided by 1000 divided by 1000 =200 GB disks. Reserve 10% of the disk space, 210 GB. It’s about 1.5 terabytes a week. The average compression ratio is 0.5, and the overall disk capacity is 0.75T. The correlation factors mainly include:

  • Number of new messages
  • replications
  • Whether to enable compression
  • Message size
  • Message retention time

4. Plan memory capacity

Kafka’s memory usage is less dependent on JVM memory and more on the operating system’s page cache. If a consumer hits the page cache, it does not consume physical I/O operations. In general, Java heap memory usage is transient and will be quickly GC. In general, it will not exceed 6GB. For 16GB machines, file system page cache can be 10-14GB.

  • If the size of the log segment is 10 GB, the page cache should be at least 10 GB.
  • It is best not to exceed 6GB of heap memory.

5. Plan CPU selection

Kafka is not a computationally intensive system, so you can have as many CPU cores as you want without having to chase the clock rate, so choose more than eight.

6. Network bandwidth determines the number of brokers

The bandwidth is mainly 1Gb/s and 10GB /s. We can call them gigabit networks and gigabit networks. Here’s an example: Our iot system processes 1Tb of data every hour of the day. We choose 1Gb/b of bandwidth. How many machines do we need to choose?

  • Assuming that the network bandwidth is dedicated to Kafka and 70% of the bandwidth is allocated to kafka servers, the single Borker bandwidth is 710Mb/s, but in case of a sudden traffic problem, it is easy to fill up the network card, so it is reduced by 1/3, or 240Mb/s. Since processing 1TTB of data per hour requires 292MB of data per second,1MB=8Mb, or 2336Mb of data per second, then processing 1TB of data per hour requires at least 2336/240=10 Broker data. Redundancy design, can eventually be set at 20 machines.

Typical recommended

  • Number of CPU cores 32
  • 32 gb of memory
  • Disk 3TB 7200 RPM three SAS disks
  • Bandwidth 1 gb/s

conclusion

Qin Kaixin was born in Shenzhen on October 27, 2018