A preface.

Small knowledge, big challenge! This paper is participating in theEssentials for programmers”Creative activities.

This paper has participated in theProject Digg”To win a creative gift package and challenge for creative incentive money.

Hello, everyone.

Kafka is a distributed, multi-partitioned, multi-replica, distributed message flow platform based on Zookeeper. Kafka is also an open source, publish-subscribe message engine system.

The blogger’s department, which uses Kafka as messaging middleware, recently encountered a bizarre bug that took a lot of effort to find.

2. Origin of bugs

One function of the module I am responsible for is to modify the network IP of the host computer.

The function page is as follows

After entering an IPV4 address, the host address of the service is changed.

First, let’s talk about the background: all the services in the test environment are deployed on one machine. After the IP is changed, the configuration of the NACOS configuration center needs to be modified synchronously, and the configuration of the system environment variables is read by each middleware. Then the machine restarts. After the restart, the Docker container of each business application and middleware is automatically started.

From the point of view of requirements, the logical implementation is relatively simple, Java programs call shell scripts to do some operations on the host, and then restart the machine.

The functional code was written in a clatter.

After writing a test, ifconfig to check that the IP has been changed, Docker PS to check that the container is started normally, front-end page simple test some functions, all normal.

I am a code genius

After the test, the test sister told me that the operation records of the system had not been generated. The operation records were all intercepted and sent to Kafka, and the special log module consumed the records to ES.

Dubious! This function is also recently developed, just corrected a round of bugs, self-tested no problem. Looking at the code, the last time this module was submitted was also by me.

Strange.

Take a look at the logs in the background and frantically print a WARN log

1 partitions have leader brokers without a matching listener...
Copy the code

I went inside kafka’s container to take a look at the consumption of the service

Kafka-consumer-groups. sh --bootstrap-server 127.0.0.1:9092 --group baiyan-topic --describeCopy the code

Seeing that the consumption is lagging behind, what is the data of the topic of the card owner

Kafka-console-consumer. sh --bootstrap-server 127.0.0.1:9092 --topic baiyan-topic --offset 373159 --partition 0Copy the code

The data for

{omitted business field,"createTime":"2021-10-20 16:27:39.447","updateTime":"2021-10-20 16:27:39.447"}Copy the code

What?

Why does this time seem to be so close to the time I updated the IP of the test environment?

There is a huge backlog of messages that cannot be processed by consumers at this point in time!

Three. Find a problem

3.1. Google error

In order to determine that this bug is caused by the modification of IP, I let the test sister on another machine to simulate the installation of a set of environment, I went to modify the IP, as expected, this problem again.

Ok, at this point, we only have two sources of bugs or entry points to solve the problem.

1 Partitions have leader brokers without a matching listener

2. Change the IP address: After the IP address is changed, Kafka messages accumulate and cannot be consumed. According to the message, the partition does not have a Lead node.

Ok, bug solving begins, step one, Google, ha ha ha ~

Google search: 1 partitions have leader brokers without a matching listener

The answers on the Internet are generally as follows:

  1. The kafka server is down. Rebalance failed.
  2. The firewall is not closed, which causes network connectivity.
  3. The proxy IP address is incorrectly configured.

To look at

First point: Kafka container starts normally, and if a new topic is added, it can also send and consume normally.

Second point: firewall is not opened, even if opened, the port is also through, excluded!

Third point: According to Google, this is the answer.

Take a look at the contents of the configuration file

cat /opt/kafka/config server.properties 
Copy the code

Parameters related to IP are normal and have been modified to the target IP value, and the configurations searched up are commented

We are using an earlier configuration

Advertised. Host. Name = target IPCopy the code

But cornered, I still hold a try mentality, according to the operation of the Internet to modify the configuration, restart.

It turned out as expected. It didn’t work.

3.2. Start with IP switch

A Google, the final search is also said kafka cluster configuration and network problems, and 3.1 search content is basically the same.

To give up.

3.3. Start with principles

At that time, I tried to solve the problem by using the information found in 3.1 and 3.2 clues, but it was still a little uncomfortable. I don’t know where to start.

On second thought, Kafka directly interacts with consumers, kafka cluster information and other synchronization information are maintained on ZooKeeper.

All right, one by one, one by one.

Let’s look at topic

Bash-4.4 # kafka-topics. Sh --zookeeper 127.0.1:2181 --topic --describe topic: baiyan-topic PartitionCount: 1 ReplicationFactor: 1 Configs: Topic: baiyan-topic Partition: 0 Leader: none Replicas: 1001 Isr: 1001Copy the code

Leader is none, but it does not use L.

Go to the ZooKeeper container and take a look at kafka’s information

The bin directory is displayed

1. Check partition information

ls /brokers/topics/baiyan-topic/partitions
​
[0]
Copy the code

2. View the BROKER ID

ls /brokers/ids
​
[1002]
Copy the code

Do you see anything different so far?

The id of the broker on Zookeeper has changed to 1002, but the topic information copy in Kafka and the node information in Isr are still 1001.

Kafka needs to get broker node information from ZooKeeper to build a cluster. Kafka cannot find 1002 nodes on ZooKeeper, so the leader is None and cannot build a cluster.

Ok, so now we know what causes this bug ~

3.4. Cause analysis

3.3. We already know what causes messages to block. What is the cause of the inconsistency between ZK and Kafka broker messages?

The breakthrough is pretty clear, broker inconsistency, so look at the factors that determine brokerId generation.

Kafka parameter Broker. Id kafka parameter Broker. Id kafka parameter Broker

As you can see from above, the brokerId decision comes primarily from two files, server.properties and meta.properties

Take a look at the server.properties configuration first

The default configuration starts from 1001. Kafka topic information is 1001.

Take another look at meta.properties

It exists in the configuration log.dirs directory of server.properties to view the configuration

cluster.id=uHTKS_74RhW2_wKwbuwHxQ
version=0
broker.id=1002
Copy the code

Good guy, found the problem !!!!

After changing the IP, the script changes log.dirs and regenerates as a data directory. When Kafka restarts, the brokerID inside the original topic is not changed. On ZK, as soon as Kafka goes offline, 1001 data is erased, and kafka restarts, a new log.dirs data directory is generated. And because brokerId is configured -1 on server.properties, kakfa increses from 1001 to 1002, writes to log.dirs, and registers the 1002 node to ZK, and finally kafka and ZK are inconsistent.

Solve problems

Once you know why, the solution to the problem is clear, as long as you ensure that the brokerId for the newly generated data directory is the same as the brokerId in the Topic when the IP is changed.

Method one:

Restore the broker. Id of meta. Properties to 1001 in the new log.dirs directory. Restart Kafka and synchronize 1001 to zk.

It is stupid to treat the symptoms rather than the root cause. In cluster mode, we do not know the ID of each node, so human intervention is needed.

Method 2:

Each Kafka node specifies the value of broker.id within server.properties and is not dynamically generated.

In live.

This article analyzes the problem that kafka consumer cannot consume node data after the host IP is changed. From shallow to deep about the process of bug investigation. The inconsistent broker.ID bug was finally located.

We generated the IDS dynamically because the test environment was standalone and did not specify broker.ID.

In fact, the final solution to the bug is relatively simple, change the configuration, restart it, but the troubleshooting process is difficult.

Here recall the interview should also have a lot of interviewers will ask Kafka some of the original rational essay, looking back is not unreasonable.

I strongly recommend that when using Kafka, whether single-node kafka or Kafka cluster, specify the id of each node to avoid some bizarre bugs.

Six. Contact me

If there is an incorrect place in the article, welcome to correct, writing the article is not easy, point a thumbs-up, yao yao da ~

Nailing: louyanfeng25

WeChat: baiyan_lou

Public account: Uncle Baiyan