Kakfa is a distributed publish/subscribe message queue for real-time processing of big data
1.2 message Queue 1.2.1 Traditional message queue & new message queue mode
Newer queues, like a user registration, are thrown directly into the database and returned directly to the user for success.
1.2.2. Benefits of using message queues A. Decoupling
B. Recoverability
C, the buffer
D. Flexibility & peak processing capability
E. Asynchronous communication
1.2.3 Mode of message queue A. Point-to-point mode
The message producer sends a message to the message queue, and the message consumer retrieves and consumes the message from the queue. After the message is consumed, it is not stored in the queue. So message consumers cannot consume messages that have already been consumed; Queues support multiple consumers, but only one consumer can consume a message; If you want to send the message to multiple consumers, you need to send the message multiple times
B) Publish/subscribe (one-to-many, no messages are cleared after consumers consume data)
A message producer publishes a message to a topic, and multiple message consumers (subscribers) consume the message. Unlike point-to-point, a message published to a topic is consumed by all subscribers. However, data retention is for a period of time, the default is 7 days, because it is not a storage system; Kafka is this model; There are two ways, one is consumers to take the initiative to consume (pull) messages, rather than producers push messages to consumers; The other is that producers actively push messages to consumers, similar to the public account.
1.3 kafka’s infrastructure
A broker is responsible for buffering messages. Topics can be created within the broker, and each topic has the concepts of partition and replication
The consumer group is responsible for message processing, and the consumers in the same consumer group cannot consume the data in the same partition. The consumer group is mainly to improve the consumption ability, for example, before one consumer consumed 100 pieces of data, now two consumers consume 100 pieces of data, which can improve the consumption ability. Therefore, the number of consumers in a consumer group should be smaller than the number of partitions; otherwise, there will be no partitions for consumers to consume, resulting in a waste of resources
Note: However, consumers of different consumer groups can consume the same partition data
If Kakfa wants to cluster, all it needs to do is register with a ZK, which also keeps the progress or offset or location of message consumption
Offsets are stored in ZK prior to version 0.9
After version 0.9, offsets are stored in Kafka. Kafka defines a topic dedicated to storing offsets.
Why change it? This is mainly due to the stress on ZK_ due to frequent offsets changes, and the complexity of kafka__ itself
Kafka install A, kafka install only need to decompress the installation package can be completed
Tar -zxvf kafka_2.11-2.1.1. TGZ -c /usr/local/
B. View the configuration file
[root@es1 config] # pwd / usr / local / kafka / config [root@es1 config] # ll total 84
- rw – r– r–.1 root root 906 Feb 8 2019 connect – console – sink.properties
- rw – r– r–.1 root root 909 Feb 8 2019 connect – console – source.properties
- rw – r– r–.1 root root 5321 Feb 8 2019 connect – distributed.properties
- rw – r– r–.1 root root 883 Feb 8 2019 connect – file – sink.properties
- rw – r– r–.1 root root 881 Feb 8 2019 connect – file – source.properties
- rw – r– r–.1 root root 1111 Feb 8 2019 connect – log4j.properties
- rw – r– r–.1 root root 2262 Feb 8 2019 connect – standalone.properties
- rw – r– r–.1 root root 1221 Feb 8 2019 consumer.properties
- rw – r– r–.1 root root 4727 Feb 8 2019 log4j.properties
- rw – r– r–.1 root root 1925 Feb 8 2019 producer.properties
- rw – r– r–.1 root root 6865 Jan 16 22 : 00 server – 1.properties
- rw – r– r–.1 root root 6865 Jan 16 22 : 00 server – 2.properties
- rw – r– r–.1 root root 6873 Jan 16 03 : 57 server.properties
- rw – r– r–.1 root root 1032 Feb 8 2019 tools – log4j.properties
- rw – r– r–.1 root root 1169 Feb 8 2019 trogdor.conf
- Rw-r — r– 1 root root 1023 Feb 8 2019 Zookeeper. properties C. Modify the configuration file server.properties
Set broker.id this is the unique identifier that distinguishes each node in a Kafka cluster
E. Set whether kafka topics can be deleted. By default kafka topics are not allowed to be deleted
Create a topic and specify the number of fragments and replicas
Replication-factor: indicates the number of copies
Replication-factor: indicates the number of partitions
Topic: Topic name
If the current kafka cluster has only three broker nodes, the replication-factor is at most 3. In the following example, an error will be reported if the replication-factor is 4
Kafka core knowledge literacy, the most detailed analysis of the platform, you still say not? Note: By default, each consumer belongs to a different consumer group if a consumer group profile is not specified
C, send a message, you can see that each consumer can receive the message
Messages in Kafka are classified by topic. Producer-generated messages and consumer-consumed messages are topic-oriented
Each partition has the concept of a copy
Each partition corresponds to a log file, which stores the data generated by the producer. The data generated by the producer will be continuously added to the end of the log file, and each data has its own offset, and consumers will record their consumption to that offset in real time. The offset is stored in the index file in order to continue consuming from the last location in case of an error
Kafka’s offsets are ordered within partitions, but are not ordered between partitions. Kafka does not guarantee global order
Since messages produced by the producer are constantly added to the end of the log file, in order to prevent the log file from being too large to lead to inefficient data location, Kafka adopts the mechanism of fragmentation and index, and divides each partition into multiple segments. Each segment corresponds to two files —-index and log. These two files are located in the same folder. The folder name rule is topic name + partition number
How does Kafka consume data quickly?
If you want to consume data whose offset is 3, first use the binary method to find which index file the data is in, and then use the index offset to find the data offset in the log file. In this way, data can be quickly located and consumed
So Kakfa stores data on disk, but it is still very fast to read
Kafka producers and consumers 3.1. Kafka producers and consumers kafka partition
The main reason for Kafka partitioning is to provide concurrency and improve performance, because read and write are written on a partition basis.
To which partition does the producer send the message?
A. Specify A partition on the client
B. Poll (recommended) Message 1 to P1, Message 2 to P2, message 3 to P3, Message 4 to P1, message 5 to P2, message 6 to P3……
3.2 How does Kafka ensure data reliability? To ensure that the data sent by the producer can be reliably sent to the specified topic, each partition of a topic needs to send an ACK (confirming receipt) to the producer after receiving the data sent by the producer. If the producer receives an ACK, the next round of sending will be carried out. Otherwise, the data will be re-sent
Ensure that the follower and the leader are synchronized. The leader sends an ACK to the producer to ensure that data will not be lost after the leader dies and a new leader is elected from the followers
Then how many followers will send an ACK when the synchronization is complete
Scenario 1: Send an ACK if half have completed synchronization
Solution 2: Send an ACK after all synchronization is complete.
After adopting the second scheme, imagine the following scenario: the leader receives the data and all the followers start to synchronize the data. However, one follower fails to complete the synchronization due to some fault. The leader has to wait until the synchronization is complete before sending an ACK, which greatly affects the efficiency. How to solve this problem?
How do I choose the ISR nodes?
Firstly, the communication time should be fast, and the communication with the leader should be completed quickly. The default time is 10s
Then look at the leader data gap, which defaults to 10000 messages (removed later)
Why it should be removed: Because Kafka sends messages in batches, the leader receives them immediately, but the followers have not been pulled, so they are frequently kicked out and added to the ISR. This data is stored in the ZK and memory, so the ZK and memory are frequently updated.
However, for some unimportant data, the reliability of the data is not very high and can tolerate a small amount of data loss. Therefore, there is no need to wait for the followers in the ISR to accept all the data successfully.
So Kafka provides the user with three levels of reliability that the user can trade off based on reliability and latency. This setting is set in kafka generation: the acks parameter setting
A. The value of acks is 0
Producers don’t wait for ACK, but just drop data to a topic, which has a high probability of dropping data
B. Ack is 1
After the Leader drops the disk, it returns an ACK, causing data loss. If the Leader fails after synchronization, data loss occurs
C, ack = -1 (all)
An ACK is returned only when the Leader and follower (ISR) disks fall off, resulting in data duplication. If a fault occurs when the Leader and follower (ISR) disks fall off, data duplication occurs. For example, the communication between the follower and the leader is very slow, so there is only one leader node in the ISR. At this time, the leader finishes dropping the disk, and an ACK will be returned. If the leader fails at this time, data will be lost
3.3 How does Kafka ensure consistency of consumption data? Guaranteed by HW
HW (high water level) : refers to the largest offset that consumers can see, and the smallest LEO in the LSR queue. That is to say, consumers can only see data from 1 to 6, and cannot see the following data and consume them
To avoid the failure of the leader, for example, after the current consumer consumes 8 data, the leader fails. At this time, for example, F2 becomes the leader and f2 does not have 9 data at all, so the consumer will report an error. Therefore, the parameter HW is designed to expose the least data to the consumer to avoid the above problems
3.3.1 HW Ensures data Store consistency A. Follower failure
After the Follower recovers, the Follower reads the last HW recorded on the local disk and intercepts the part of the log file that is higher than the HW to synchronize data with the leader from the HW. If the follower’s LEO is greater than or equal to the Partition’s HW, that is, after the follower catches up with the leader, the follower can join the LSR again
B. The Leader is faulty
If the Leader fails, a new Leader is selected from the ISR. Then, to ensure data consistency among multiple copies, the remaining followers intercept the log files whose values are higher than hW (the new Leader does not intercept the logs themselves) and synchronize data from the new Leader
Note: This is to ensure consistency of data storage across multiple replicas, and does not guarantee data loss or duplication
If Ack is set to -1, data will not be lost, but data will be repeated (at least once).
If Ack is set to 0, data will not be duplicated, but data will not be lost at most once.
But what if you have your cake and eat it? That’s when Exactl once was introduced.
After version 0.11, idempotence was introduced to address data duplication within Kakfa clusters, and before version 0.11, it was handled by consumers themselves
If idempotent is enabled, the ack default is -1. Kafka assigns a PID to each producer, not a SEqnumber to each message. If pid, partition, and seqnumber are the same, Kafka considers duplicate data and does not store it on disk. However, if the producer dies, there will also be data duplication. So idempotence addresses data duplication in a single partition in a single session, but it does not address data duplication between partitions or across sessions.
There are two ways to consume messages in the message queue, push (wechat official account) and pull (Kafka). The push mode is difficult to adapt to consumers with different consumption rates, because the consumption rate is determined by the broker. His goal is to deliver messages as quickly as possible, but it is easy for consumers to fail to process messages, typically through denial of service and network congestion. The pull approach allows consumers to consume messages at an appropriate rate of consumption power
The drawback of Pull is that if Kafka has no data, the consumer may fall into an infinite loop and return empty data. In this case, kafka’s consumer passes a timeout parameter when consuming data. If there is no data available, the consumer will wait a certain amount of time before returning
3.4.2 Partition Allocation Policy A consumer group has multiple consumers, and a topic has multiple partitions. Therefore, partition allocation is inevitably involved, that is, to determine which partition is consumed by which consumer
Kafka provides two methods: RountRobin for topic groups and Range for individual topics
Rotation training: The prerequisite is that all the consumers in one consumer subscribe to the same topic. Otherwise there will be problems; Non-default mode
Consumers in the same consumer group cannot consume the same partition at the same time
For example, three consumers consume nine sections of a topic
First, the two topics are treated as one topic, then hash by topic and partition, and then hash by hash. Then the training was assigned to two consumers in a consumer group
What if you subscribe in the following way?
For example, there are 3 topics, and each topic has 3 partitions, and there are 2 consumers in a consumer group. Consumer 1 subscribes to topic1 and topic2, and consumer 2 subscribes to topic2 and topic3. In such a scenario, there would be a problem with subscribing to topics in rotation
What if you subscribe in the following way
For example, if there are two topics, each of which has three partitions, and a consumer group has two consumers, consumer 1 subscribing to TopIC1 and consumer 2 subscribing to Topic2, there will also be problems in subscribing to topic2 through rotation training
So we’ve been emphasizing that the premise of subscribing to a topic using rotational training is that all consumers in a group subscribe to the same topic;
So rotation is not kafka’s default
Range: is divided by topic, the default allocation
Range’s problems will lead to unbalanced consumer data
For example, if a consumer group subscribes to 2 topics, consumer 1 consumes 4 partitions and another consumer consumes only 2 partitions
3.4.3 Maintenance of offset Consumers may have power outages and other faults during consumption. After recovery, consumers need to continue to consume from the position before the fault. Therefore, consumers need to record which offset they consume so that they can continue to consume after recovery from the fault
Offset stores two positions: zk and Kafka
Let’s first look at saving offset to Zk
The unique offset is determined by the consumer group, topic, and partition elements
So if a consumer in the consumer group dies, or consumers can still get the offset
3.4.5 Case of Consumer Group Modify the ID of consumer group
Start again a consumer that belongs to a different consumer group
Kafka producer writes data to a log file and continues writing data to the end of the file in sequence. Sequential writes to the same disk can reach 600M/S, while random writes only reach 100K/S. This is due to the mechanical structure of the disk, sequential writing is fast because it saves a lot of head addressing time
4.3 Zero copy Technology Under normal circumstances, data is read to the kernel space first, then to the user space from the kernel space, then to the kernel space after the IO interface of the operating system is adjusted, and finally to the hard disk
A single broker in a Kafka cluster is elected as a controller, which is responsible for managing the up-down and down-line of the broker in the cluster, allocating all the partition copies to the topic, and electing the leader.