background

On February 10, at 1:30 PM, we received A feedback from the user side. We found that partition 34 of kafka cluster A could not elect A leader. As A result, some messages were sent to this partition.

In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.
Copy the code

Broker0 node is then not found in Kafka-Manager and is in suspended animation, but the process is still in broker0, restarts for a long time and does not respond, then kills the process with the kill -9 command, and the restart fails, causing the following problems:

Since the leader copy for A Topic 34 partition is in Broker0, the other copy has been kicked out of the ISR because it can’t keep up with the leader, Version 0.11 of kafka’s unclean. Leader. Election. Enable parameter to false by default, a copy of the said partition is outside the ISR elected leader, As A result, the message sent by topic A continues to report that the leader of partition 34 does not exist, and the message that has not been consumed by the partition cannot be consumed.

Kafka log analysis

Log file kafkaserver. log. The following logs are generated during the restart of Kafka:

Find a large number of corrupt subject index files and rebuild the index file warning message, locate the source:

kafka.log.OffsetIndex#sanityCheck

Describe it as I understand it:

${log.dirs} is used to check if Kafka is cleanshutdown. Kafka_cleanshutDown is a file in the ${log.dirs} directory. Then it needs to recover log processing, in which the sanityCheck() method is called to check the index file of each log sement to ensure the integrity of the index file:

  • Entries: Since Kafka’s index file is a sparse index and does not store the location of each message in an.index file, the entry mode is introduced, which records only one location per batch of messages. So entries in index files = map.position/entrySize;
  • LastOffset: displacement of the lastEntry, that is, lastOffset = lastEntry.
  • BaseOffset: refers to the baseOffset of the index file, which is the number of the index file name.

The mapping between index files and log files is as follows:

The index file is damaged according to:

_entries == 0 || _lastOffset > baseOffset = false / / damage
_entries == 0 || _lastOffset > baseOffset = true / / normal
Copy the code

My understanding of this judgment logic is as follows:

A zero value for index blocks means that the index has no content and the index file is considered uncorrupted. When index blocks are not equal to zero, it is necessary to determine whether the last offset of the index file is greater than the base offset of the index file. If not, the index file is corrupted and needs to be rebuilt.

So why does this happen?

I seem to have found some answers in the related issue:

Issues.apache.org/jira/browse…

Issues.apache.org/jira/browse…

In general, abnormal exits in older versions seem likely to occur?

The log file will delete and rebuild the damaged log file. Let’s continue to look at the error message that caused the restart failure:

This is where the problem can occur when deleting and rebuilding the index. There are many descriptions of this bug on Issues.apache.org, and I posted two of them here:

Issues.apache.org/jira/browse…

Issues.apache.org/jira/browse…

These bugs are subtle and very difficult to reproduce, and since they don’t exist in later versions of Kafka, the most important thing is to update the Kafka version. I will continue to study the source code after I become familiar with Scala, and the details will be presented in the source code.

Solution analysis

For both background issues, the inconsistency is due to broker0 restart failure, so we need to start Broker0 successfully to restore the A Theme 34 partition.

Because of the log and index files, we just need to delete the damaged log and index files and restart.

However, if the log index file in partition 34 is also damaged, the unconsumed data in this partition will be lost. The reasons are as follows:

At this point, the leader of partition 34 is still in Broker0. Since Broker0 is down and partition 34 isR only has the Leader, partition 34 is unavailable. In this case, suppose you empty the leader data in Broker0. Restart Kafka will still use a copy on Broker0 as the Leader, so the leader’s offset is used, and the leader’s data is empty, so the follower data must be truncated to zero, no greater than the Leader’s offset.

This seems unreasonable, at this time can not provide a possible operation:

When the partition is unavailable, you can manually set any replica within the partition as the leader?

I will analyze this issue in a separate article later.

Subsequent cluster optimization

  1. Develop an upgrade plan to upgrade the cluster to version 2.2;
  2. The default systemd timeout value for each server is 600 seconds. Because I found that o&M shut down 33 nodes on the day of the failure and did not respond for a long time, I used the kill -9 command to force the shutdown. As I understand it, shutting down a Kafka server requires a lot of work, which can take quite a while, and Systemd’s default timeout value is 90 seconds to stop the process, which amounts to an abnormal exit.
  3. Will broker parameters unclean. Leader. Election. The enable is set to true (ensure that partition from the ISR elected leader);
  4. Will default broker parameters. The replication. Factor set to 3 (more highly available, but will increase the cluster storage pressure, can further discussion).
  5. Set the broker parameter min.insync.replicas to 2 (this ensures that there are two ISRs at the same time, but is it necessary to do so at a performance cost? Because we already will be unclean. Leader. Election. The enable is set to true);
  6. The sender sends acks=1 (make sure that a copy is successfully synchronized when sending, but is this necessary because there may be a performance penalty).

Author’s brief introduction

Zhang Chenghui, currently working in the Technology Platform Department of ZHONGtong Science and Technology Information Center, is mainly responsible for the research and development of Zhongtong message platform and full-link pressure test project. He loves to share technology. He is the author of wechat official account “Back-end Progress”, a blogger of objcoding.com/, a Seata Contributor, GitHub ID: Objcoding.