Introduce you to Kafka in the vernacular
Kafka is a high-throughput distributed publish-subscribe messaging system that processes all of the consumer action streams on a website.
Kafka basis
The role of the message system
As most of you know, packing oil is an example.
So the message system is what we call the repository above, which can act as a cache in the intermediate process and enable decoupling.
Introduce a scenario, we know that China Mobile, China Unicom, China Telecom log processing, is outsourced to do big data analysis, suppose that now their logs are handed over to your system to do user portrait analysis.
Following on from the message system mentioned earlier, we know that the message system is actually an analog cache, and is only used as a cache but not a real cache. Data is still stored on disk rather than in memory.
The Topic theme
Kafka learned how to design a database and designed a topic, which is similar to a table in a relational database.
At this time, I need to obtain the data of China Mobile, so I can directly listen to TopicA.
Partition Partition
Kafka also has a concept called Partition, which is originally a directory on a server. There are multiple partitions under a topic, and these partitions are stored on different servers. In other words, different directories are built on different hosts. The main information about these partitions is in the.log file. Similar to the database partition, to improve performance.
As for why the performance is improved, it’s simple: multiple partitions with multiple threads, and multiple threads working in parallel is certainly much better than single threads.
Topic and partition are similar to the concepts of table and region in HBASE. Table is only a logical concept. Regions are really used to store data. A Topic is also a logical concept, and a partition is a distributed storage unit. This design is the basis for ensuring massive data processing. For comparison, if HDFS does not have block design, a 100T file can only be placed on one server, which will directly occupy the entire server. After block is introduced, large files can be stored on different servers.
Note:
Partitions have a single point of failure, so we set the number of copies for each partition; The partition number starts with 0.
Producer – Producer
The producer sends data to the messaging system.
Consumer – Consumer
It is the consumer who reads the data in Kafka.
Message – a Message
The data we process in Kafka is called a message.
Kafka’s cluster architecture
Create a TopicA topic with three partitions stored on different servers (brokers). Topic is a logical concept, and we cannot directly draw the units related to Topic in the diagram.
Note: Kafka did not have replicas prior to version 0.8, so you will lose data in the event of a server outage, so avoid using Kafka prior to this version.
A copy of the up –
Partitions in Kafka can be configured with multiple copies to ensure data security.
At this point we set three replicas for the partitions 0, 1, and 2 (actually two replicas are appropriate).
In fact, each copy has its own role. One copy is selected as the leader and the others as followers. When sending data, the producer directly sends data to the leader partition. The follower partition then synchronizes data from the leader. When consumers consume data, the follower partition also consumes data from the leader.
Consumer Group – Consumer Group
When we consume data, we specify a group id in the code. The id represents the name of the consumer group, and the group id is set by default even if it is not set.
conf.setProperty("group.id","tellYourDream")
Copy the code
Some well-known messaging systems are generally designed so that once one consumer consumes the data in the messaging system, none of the other consumers can consume the data. But Kafka is not like this. For example, consumerA consumes topicA data.
consumerA:
group.id = a
consumerB:
group.id = a
consumerC:
group.id = b
consumerD:
group.id = b
Copy the code
Then consumerB can also consume TopicA’s data, but it can’t, but we can specify another group. Id in consumerC, and consumerC can consume TopicA’s data. ConsumerD is also not consumable, so in Kafka, different groups can have a single consumer consuming the same topic’s data.
So consumer group is to allow multiple consumers to consume information in parallel, and they will not consume the same message, as follows, consumerA, B, C will not interfere with each other.
consumer group:a
consumerA
consumerB
consumerC
Copy the code
As shown in the figure, as mentioned above, consumers will directly establish a connection with the leader, so they consume three leaders respectively. Therefore, a partition will not allow multiple consumers in the consumer group to consume, but in the case of unsaturated consumers, a consumer can consume data from multiple partitions.
Controller
Understand the rule: 95% of big data distributed file systems are master-slave architectures, and some are equal-based architectures such as ElasticSearch.
Kafka also has a master-slave architecture. The master node is called the controller and the rest are slave nodes. The controller manages the entire Kafka cluster in coordination with ZooKeeper.
How do Kafka and ZooKeeper work together
Kafka relies heavily on the ZooKeeper cluster. All brokers register with ZooKeeper at startup to elect a controller. The election process is a straightforward one, with no algorithms involved.
It listens to multiple directories in ZooKeeper. For example, if there is a directory /brokers/, other nodes register with it and use their ids. For example, /brokers/ 0,1,2.
When registering each node is bound to expose their own host name, port number, and so on information, at this time the controller is registered to read up from the node’s data (through the surveillance mechanism), to generate cluster metadata information, after the information were distributed to other servers, let the other server can perceive to other members of the cluster.
In this scenario, we create a topic, Kafka will generate the partitioning scheme in this directory, the controller will listen to this change, and it will synchronize the meta information in this directory. Then it is also delegated to its slave nodes, so that the entire cluster is informed of the partitioning scheme, and the slave nodes create their own directories and wait for the partition copy to be created. This is also the management mechanism of the entire cluster.
What is Kafka’s performance good for?
Order to write
Each time an OPERATING system reads or writes data from a disk, it needs to address the data. That is, it needs to find the physical location of the data on the disk before reading or writing the data. If the disk is a mechanical disk, addressing takes a long time.
Kafka is designed so that data is stored on disk. Generally, it works better when data is stored in memory. But Kafka uses sequential write, append data is appended to the end, disk sequential write performance is extremely high, in a certain number of disks, the number of revolutions to a certain case, and the basic memory speed is consistent.
Random writing modifies data in a specific location in the file, which results in poor performance.
Zero copy
Let’s look at the non-zero copy case first.
It can be seen that data is copied from memory to the Kafka server process and then to the socket cache. The whole process takes a lot of time. Kafka uses The Linux sendFile technology (NIO), eliminating process switching and a data copy, which makes the performance better.
Kafka has a maximum of one gigabyte of.log files per partition. This limit is used to facilitate loading.logs into memory.
00000000000000000000.index
00000000000000000000.log
00000000000000000000.timeindex
00000000000005367851.index
00000000000005367851.log
00000000000005367851.timeindex
00000000000009936472.index
00000000000009936472.log
00000000000009936472.timeindex
Copy the code
Numbers such as 9936472 represent the start offset contained in the log segment file, which means that at least 10 million pieces of data have been written to the partition. A Kafka broker has a parameter, log.segment.bytes, which limits the size of each log segment file to 1GB. When a log segment file is full, a new log segment file is automatically opened for writing. This process is called log rolling, and the log segment file that is being written to is called the active log segment.
If you read the previous two articles on HDFS, you will see that NameNode edits log is also limited, so these frameworks take these issues into account.
Network design for KafkaKafka’s network design is all about tuning Kafka, which is why it supports high concurrency.
Client requests are sent to an Acceptor. The broker has three threads (default: three) called Processor. Acceptors do not process client requests. It is directly encapsulated into socketchannels and sent to a queue of these processors. It is sent to the first processor in polling mode, then to the second processor, then to the third processor, and then back to the first processor again. When a consumer thread consumes a socketChannel, it gets a request request, which is accompanied by data.
By default, there are 8 threads in the thread pool. These threads are used to process requests, parse requests, and, if they are write requests, write them to disk. If read, return the result.
The processor reads the response data from the response and sends it back to the client. This is Kafka’s three-tier network architecture.
So if we need to augment Kafka, we can increase the number of processors and the number of processing threads in the thread pool. In fact, the request and response part plays a caching effect to consider the problem that the processor generates requests too fast and the number of threads is not enough to process them in time.
So this is an enhanced version of reactor’s network threading model.
※ Some articles are from the Internet. If there is any infringement, please contact to delete. More articles and materials | click behind the text to the left left left 100 gpython self-study data package Ali cloud K8s practical manual guide] [ali cloud CDN row pit CDN ECS Hadoop large data of actual combat operations guide the conversation practice manual manual Knative cloud native application development guide OSS Operation and maintenance manual Cloud native architecture white paper Zabbix enterprise distributed monitoring system source document 10G