The role of messaging systems
Should most of the partners are clear, with oil boxing for example
So the message system is what we call the warehouse in the figure above, which can act as a cache during the intermediate process and realize the function of decoupling.
To introduce a scenario, we know that the log processing of China Mobile, China Unicom and China Telecom is outsourced to do big data analysis. Suppose that now their log processing is handed over to the system you do to do user portrait analysis.
According to the function of the message system mentioned earlier, we know that the message system is actually a mock cache, and only acts as a cache rather than a true cache. The data is still stored on disk rather than in memory.
1. The Topic subject
Kafka studied the design of databases, and designed topics in them, which are similar to tables in a relational database
At this time, I need to obtain the data of China Mobile, so I can directly monitor TopicA
Kafka also has a concept called partitions. Partitions on a server start as a single directory, with multiple partitions under a topic, and these partitions are stored on different servers, or in other words, directories on different hosts. The main information about these partitions is stored in the.log file. Similar to partitions in a database, to improve performance.
Topic and partition are similar to the concepts of table and region in HBASE. Table is only a logical concept, and region is the real storage of data. Regions are distributed across servers. Similarly, Kafka and Topic are logical concepts. Partitions are distributed storage units. This design is the basis for ensuring massive data processing. For example, if HDFS does not have block design, a 100T file can only be placed on a single server, which directly occupies the entire server. After the introduction of block, large files can be distributed and stored on different servers. Note:
1. Partitions will have single point of failure, so we will set 2 copies for each partition. The partition numbers start from 0Copy the code
It is the producer that sends data into the messaging system
5.4.Consumer — Consumer
It’s the consumer who reads the data from Kafka
6.5. The Message – the Message
The data we process in Kafka is called message 2
Create a TopicA theme with three partitions stored on different servers, under the broker. Topic is a logical concept, and the related units of a Topic cannot be drawn directly in the diagram
Note: Kafka does not have replicas prior to 0.8, so data can be lost in the event of a server outage, so avoid using Kafka – replicas prior to this release
Partitions in Kafka can have multiple copies for each partition to ensure data security.
At this point, we set 3 copies of partition 0,1, and 2 respectively.
In fact, each copy has different roles. They choose one copy as the leader and the rest as followers. When our producer sends data, it directly sends it to the leader partition. The follower partition then synchronizes the data with the leader, and the consumers consume the data from the leader.
Consumer Group – a Group of consumers
When we consume data, we will specify a group. Id in the code. This ID represents the name of the consumption group, and this group
Some of the messaging systems we are familiar with are designed in such a way that once one consumer consumes the data in the messaging system, no other consumer can consume the data.
However, Kafka does not ConsumerA, for example, now consumes data from TopicA:Then ask ConsumerB to consume TopicA’s data, it can’t consume TopicA’s data, but we re-specify another group. Id in ConsumerC, ConsumerC can consume TopicA’s data.
ConsumerD is also non-consumable, so in Kafka, different groups can have a single consumer consuming data on the same topic.
So consumer groups exist to allow multiple consumers to consume information in parallel, and they do not consume the same message.
As follows, ConsumerA, B, and C don’t interfere with each other:
As shown in the figure, since consumers will directly establish contact with the Leader as mentioned above, they consume three leaders respectively, so a partition will not allow multiple consumers in the consumer group to consume, but under the condition of unsaturated consumers, one consumer can consume data of multiple partitions.
Controller
Be familiar with the fact that 95% of big data distributed file systems are master-slave, and a few are equal-equality architectures, such as ElasticSearch. Kafka is also a master-slave architecture. The master node is called a Controller, and the remaining nodes are slave nodes. The controller works with ZooKeeper to manage the entire Kafka cluster. How do Kafka and ZooKeeper work together
Kafka relies heavily on the ZooKeeper cluster (so the previous ZooKeeper article was somewhat useful). All brokers register with ZooKeeper at startup in order to elect a controller. This election process is simple and straightforward. There is no algorithm involved. It listens to multiple directories in ZooKeeper. For example, there is a directory /brokers/ in which other nodes register themselves. In this case, the naming rules are generally their ID numbers. For example, at /brokers/0,1,2, each node must expose its host name, port number and other information during registration. At this time, the controller needs to read the data of registered slave nodes (through listening mechanism), generate cluster metadata information, and then distribute this information to other servers. Make other servers aware of the presence of other members of the cluster. In a simulated scenario, we create a topic, kafka will generate the partition scheme in the directory, the controller will listen to the change, it will synchronize the meta information of the directory, Then it is also delegated to its slave nodes, through this method to let the whole cluster know the partition scheme, at this time the slave nodes each create a directory to wait for the creation of partition copy. This is also the management mechanism for the whole cluster.
Extra time
1. What’s great about Kafka performance?
1) order to write
Each time the operating system reads or writes data from a disk, it needs to address the physical location of the data on the disk before reading or writing data from the disk. For mechanical disks, the addressing takes a long time. Kafka is designed to store data on disk, and in general, performance is best when data is stored in memory. Sequential disk write performance is very high, in a certain number of disks, the number of revolutions up to a certain situation, basic and memory speed is consistent with random write in a file location to modify the data, performance will be low. (2) zero copy
Let’s look at the non-zero copy case
Kafka uses The Linux sendFile technology (NIO), which eliminates the need for process switching and a single copy of data for better performance.
3.2. Segmented log storage
Kafka limits the size of a.log file in a partition to 1 gb. This limit is intended to make it easier to load.log files into memory.
A number such as 9936472 represents the starting Offset contained in the log segment file, which means that close to 10 million pieces of data have been written to the partition.
Kafka Broker has a parameter, log.segment.bytes, that limits the size of each log segment file to a maximum of 1GB.
When a log segment file is full, a new log segment file is automatically created to write data. This process is called log Rolling. The active log segment file being written is called active log segment.
If you are familiar with HDFS, you will find that NameNode’s Edits log also has limitations, so these frameworks take these issues into account.
3.Kafka network design
The network design of Kafka is related to the tuning of Kafka, which is why it supports high concurrency
An Acceptor sends all requests to an Acceptor first. The broker has three threads (by default, three) called processor. Acceptors do not process requests from the client. They are directly encapsulated into socketchannels and sent to these processors to form a queue. They are sent to the first processor first, then to the second, then to the third, and then back to the first. When the consumer thread consumes these Socketchannels, it gets request after request, which is accompanied by data. By default, there are eight threads in the thread pool. These threads are used to process requests, parse requests, and write requests to disk if they are written. Returns the result if read. The processor reads the response data from the response and sends it back to the client. This is Kafka’s three-tier network architecture. So if we need to optimize Kafka, add more processors and more processing threads in the thread pool, we can do that. The part of request and response actually serves as a cache, considering that the processor generates requests too fast and cannot process them in time due to insufficient threads. So this is an enhanced version of the REACTOR Network threading model.
This article is published by OpenWrite!