Today, Kafka is no longer just a message queue system. Kafka is a distributed stream processing platform that is being used by more and more companies. Kafka can be used for high performance data pipeline, stream processing analysis, data integration and other scenarios. This article has summarized some of the most common Kafka interview questions that you might find helpful. It mainly includes the following contents:
-
How does Kafka protect against data loss?
-
How to solve the Kafka data loss problem?
-
Does Kafka guarantee permanent data loss?
-
How do I guarantee that messages in Kafka are ordered?
-
How do I determine the number of partitions for a Kafka theme?
-
How do I adjust the number of Kafka theme partitions in production?
-
How do I rebalance a Kafka cluster?
-
How to check whether there is lagging consumption in the consumer group?
Q1: How does Kafka protect data from loss?
This question has become a Kafka interview ritual and, like Java’s HashMap, is one of the most frequently asked interview questions. So what are we to make of this? The question is how Kafka protects against data loss, that is, what mechanisms Kafka’s Broker provides to ensure data loss.
In fact, for Kafka Broker, Kafka’s replication mechanism and partition multi-copy architecture are the core of Kafka reliability assurance. Writing messages to multiple copies allows Kafka to maintain message persistence in the event of a crash.
Now that you know the heart of the question, how do you answer this question: There are three main aspects
1. Number of replication factors in Topic: replication.factor >= 3
2. ISR: min.insync.replicas = 2
3. To disable the unclean elections: unclean. Leader. Election. Enable = false
The above three configurations will be analyzed step by step:
- A copy of the factor
Kafka’s topics are partitioned and can be configured with multiple copies of the partition, which can be done with the replication.factor parameter. Zone replicas in Kafka are classified into two types: Leader Replica and Follower Replica. When creating a zone, one Replica is elected as the Leader Replica, and the other replicas become Follower replicas automatically. In Kafka, follower replicas are unserviced, meaning that neither follower replicas can respond to consumer or producer read/write requests. All requests must be handled by the lead copy. In other words, all read and write requests must go to the Broker of the leader replica, which is responsible for processing them. The follower replica does not process client requests, and its only task is to synchronize with the leader replica by asynchronously pulling messages from the leader replica and writing them to its own commit log.
In general, a set of 3 copies will suffice for most usage scenarios, but it may be 5 copies (such as banks). If the replica factor is N, data can still be read from or written to a topic even if n-1 brokers fail. Therefore, a higher replica factor leads to higher availability, reliability, and fewer failures. On the other hand, the replica factor N requires at least N brokers, and there will be N copies of data, which means they will take up N times the disk space. In a real production environment, there is a tradeoff between availability and storage hardware.
In addition, the distribution of replicas also affects availability. By default, Kafka ensures that each copy of a partition is distributed across different brokers, but if these brokers are on the same rack, the partition becomes unavailable if the switch in the rack fails. Therefore, it is recommended to distribute brokers on different racks. You can use the broker.rack parameter to configure the name of the rack on which the Broker resides.
- List of synchronized replicas
In-sync Replica (ISR) is called a synchronization replica. Copies In an ISR are all the copies that are synchronized with the Leader. Therefore, a follower that is not In the list is considered out of sync with the Leader. So what is a copy of an ISR? The first thing to make clear is that the Leader copy always exists in the ISR. Whether the follower copy is in the ISR depends on whether the follower copy is “synchronized” with the Leader copy.
The Kafka broker has a parameter **replica.lag.time.max.ms**, which indicates the maximum interval between the follower replica and the Leader replica. The default is 10 seconds. This means that the follower copy can be considered to be synchronized with the leader copy as long as the interval between the follower copy and the leader copy is not more than 10 seconds. Therefore, even if the current follower copy is a few messages behind the leader copy, If you catch up with the Leader replica within 10 seconds, you won’t be kicked out.
It can be seen that ISR is dynamic, so even if three replicas are configured for the partition, there will still be only one replica in the synchronized replica list (the other replicas are removed from the ISR list because they cannot synchronize with the leader in time). If this synchronous copy becomes unavailable, we must choose between availability and consistency (CAP theory).
According to Kafka’s definition of reliability assurance, a message is not considered committed until it has been written to all synchronized copies. But if “all copies” here contain only one synchronous copy, the data will be lost when that copy becomes unavailable. If you want to ensure that committed data is written to more than one copy, you need to set the minimum number of synchronous copies to a higher value. For a topic partition with three replicas, if min.insync.replicas=2, there must be at least two synchronous replicas to write data to the partition.
If this is done, at least two replicas must exist in the ISR. If the number of replicas in the ISR is less than two, the Broker will stop accepting requests from producers. Try to send data producers will receive NotEnoughReplicasException abnormalities, consumers still can continue to read the existing data.
- Disable Unclean Election
The process of selecting a partition from a list of synchronized replicas as the Leader partition is called the Clean Leader election. Note that this is distinguished from the process for selecting a partition as the leader in an unsynchronized replica. The process for selecting a partition as the leader in an unsynchronized replica is called unclean Leader election. Since ISRS are dynamically adjusted, there will be cases where the ISR list is empty. In general, asynchronous replicas lag too far behind the Leader, so data loss may occur if these replicas are selected as the new Leader. After all, the messages held in these replicas lag far behind those in the old Leader. Elections in Kafka, they can through the Broker of the copy process parameters * * unclean. Leader election. * * control whether to allow the enable unclean leader election. Enabling Unclean Leader election may result in data loss, but the upside is that it keeps the Leader copy of the partition in existence and prevents it from stopping external services, thus improving high availability. On the other hand, the advantage of Unclean Leader election prohibition is that it maintains data consistency and avoids message loss at the expense of high availability. This is what the CAP theory of distributed systems says.
Unfortunately, the unclean leader election election process may still lead to data inconsistencies because the synchronized replica is not exactly synchronous. Because replication is done asynchronously, there is no guarantee that followers will get the latest messages. For example, if the Leader partition’s last message has an offset of 100, the replica’s offset may not be 100, which is affected by two parameters:
Replica.lag.time.max. ms: indicates the time between the replica and the leader replica
Zookeeper. Session. A timeout. Ms: and zookeeper session timeout
In short, if we allow unsynchronized replicas to become leaders, we run the risk of data loss and data inconsistencies. If they are not allowed to become leaders, then we accept lower availability because we have to wait for the original leader to become available.
For unclean election, different scenarios have different configurations. Systems with high requirements for data quality and consistency will disable this unclean leader election (for example, in a bank). In systems with high availability requirements, such as real-time click-flow analysis systems, the unclean leader election is generally not disabled.
Q2: How to solve Kafka data loss problem?
You might be asking: how is this different from Q1? In fact, the general interview question can be understood as a question. The distinction is made here because the solutions are different. Q1 looks at data loss from the perspective of Kafka’s Broker, while Q2 looks at data loss from the perspective of Kafka’s producer and consumer.
Here’s how to answer this question: There are two main aspects:
Producer
retries=Long.MAX_VALUE
Sets retries to a larger value. The retries parameter is also a parameter of Producer, which automatically retries the Producer. When transient jitter occurs, messages may fail to be sent. In this case, a Producer whose REtries > 0 is configured can automatically retry message sending to avoid message loss.
acks=all
Set acks = all. Acks are a parameter of Producer that represents your definition of a “committed” message. If set to all, it means that all replica brokers must receive the message before it is considered “committed.” This is the highest level of “submitted” definition.
max.in.flight.requests.per.connections=1
This parameter specifies how many messages a producer can send before receiving a server response. The higher its value, the more memory it consumes, but also improves throughput. Setting it to 1 ensures that messages are written to the server in the order they were sent, even if retries occur.
The Producer uses an API with callback notifications, which means instead of using producer.send(MSG), use producer.send(MSG, callback).
Other error handling
Most errors can be easily handled without message loss using the producer’s built-in retry mechanism, although other types of errors, such as message size errors, serialization errors, and so on, still need to be handled.
Consumer
To disable automatic submission: enable.auto.mit =false
The consumer submits the offset after processing the message
Configure auto. Offset. The reset
This parameter specifies what the consumer will do if there is no offset to commit (such as when the consumer starts for the first time) or if the requested offset does not exist on the broker (such as when the data has been deleted).
There are two configurations for this parameter. One is earliest: the consumer reads data from the start of the partition, regardless of whether the offset is valid or not, which results in a large amount of repeated data being read by the consumer with minimal data loss. One is Latest (the default). If this configuration is selected, the consumer reads the data from the end of the partition, which reduces the number of reprocessed messages, but it is likely that some will be missed.
Q3: Does Kafka guarantee permanent data loss?
Some measures to protect data from loss are analyzed above, which can avoid data loss to some extent. Note, however, that Kafka only guarantees limited persistence for committed messages. So Kafka is not completely guaranteed against data loss, and there are trade-offs to be made.
First, understand what a committed message is. When Kafka’s brokers successfully receive a message and write it to a log file, they tell the producer program that the message was successfully committed. At this point, the message officially becomes committed in Kafka’s eyes. So whether ack=all or ACK =1, either way, Kafka persists only on committed messages to ensure that this thing remains the same.
Second, understand that there is a limited persistence guarantee, which means that Kafka cannot guarantee against message loss under any circumstances. Kafka brokers must be available. In other words, if a message is stored on N Kafka brokers, then at least one of N brokers is alive. As long as this condition is true, Kafka guarantees that your message will never be lost.
To summarize, Kafka does not lose messages, but the messages must be committed and must meet certain conditions.
Q4: How do I guarantee that messages in Kafka are ordered?
First, to be clear: Kafka’s topics are partitioned. If a topic has multiple partitions, Kafka sends them to the corresponding partition by key. So, for a given key, the corresponding record is sorted within the partition.
Kafka ensures that messages within a partition are ordered. Producers send messages in the order in which the Broker writes them to the partition, and consumers consume them in the same order.
In some scenarios, the order of messages is very important. For example, saving and then withdrawing money are two very different outcomes from withdrawing money and then saving.
Parameter Max. The problems above mentioned in the flight. Requests. Per. Connections = 1, this parameter is used in the retry count is greater than or equal to 1, to ensure that the data to order. If this parameter is not 1, then if the first batch fails and the second batch succeeds, the Broker retries the first batch. If the first batch retries successfully, the order of the messages from the two batches is reversed.
In general, if you have any requirements on the news order, then the data is not lost in order to ensure, first need to set up sending retries retries > 0, need to get the Max. At the same time in the flight. Requests. Per. Connections parameter set to 1, so the producers attempt to send the first message, No other messages are sent to the broker, affecting throughput but ensuring the order of messages.
In addition, single-partitioned topics can be used, but with a significant throughput impact.
Q5: How do I determine the appropriate number of Kafka theme partitions?
Choosing proper number of partitions can achieve the purpose of highly parallel read and write and load balancing. Balancing load on partitions is the key to achieve throughput. Estimates need to be made based on the expected throughput of producers and consumers in each partition.
For example, if the expected read rate (throughput) is 1GB/Sec and the read rate of a consumer is 50MB/Sec, at least 20 partitions and 20 consumers (one consumer group) are required. Similarly, if you expect to produce data at a rate of 1GB/Sec, and each producer produces at a rate of 100MB/Sec, you need 10 partitions. In this case, if you set up 20 partitions, you can guarantee both 1GB/Sec production rate and consumer throughput. It is often necessary to adjust the number of partitions to the number of consumers or producers so that both producer and consumer throughput can be achieved.
A simple calculation formula is: number of partitions = Max (number of producers, number of consumers)
- Number of producers = total production throughput/maximum production throughput of each producer for a single partition
- Number of consumers = total consumption throughput/maximum throughput consumed per consumer from a single partition
Q6: How do I adjust the number of Kafka theme partitions in production?
Note that when we increase the number of partitions for a topic, we violate the fact that the same key is being partitioned. We could create a new topic with more partitions, then suspend producers, copy data from the old topic to the new topic, and then switch consumers and producers to the new topic, which would be tricky.
Q7: How do I rebalance Kafka clusters?
The cluster needs to be rebalanced when:
- The uneven distribution of topic partitions across the cluster results in an unbalanced cluster load.
- The broker went offline causing partitions to be out of sync.
- Newly added brokers need to get loads from the cluster.
Use the kafka-reassignment-partitions. Sh command to rebalance
Q8: How to check whether lagging consumption exists in the consumer group?
We can use the kafka-consumer-groups.sh command to check this, for example:
$ bin/kafka-consumer-groups.sh --bootstrap-server cdh02:9092 --describe --group my-group
## will display the following metricsTOPIC PARTITION current-offset log-end-offset LAG consumer-id HOST client-id CURRENT OFFSET of the TOPIC PARTITION Indicates the number of LEO LAG messages. CONSUMER ID Indicates the ID of the HOST CLIENTCopy the code
Normally, if it works well, the value of current-offset will be very close to the value of ** log-end-offset **. You can use this command to see which partitions are lagging in consumption.
conclusion
This article focuses on eight common Kafka interview questions, with answers for each question. With these questions, I believe you will have a deeper understanding of Kafka.
If you found this article helpful, please share, like, and retweet.