Daily operations
Troubleshoot problems
How can the lack of Didi open source
Didi LogiKM one-stop Kafka monitoring and control platform
This article came out of a question from a group of friends, and I wrote this article
Progressive group plus V :jjdlmn_
Kafka column sorting address please stamp here 0.0
Define the nouns: pre-migrated Broker: OriginBroker, post-migrated copy TargetBroker
The premise
I encourage you to read the following articles (if you can’t click on them, I haven’t posted them yet)
[[kafka source] ReassignPartitionsCommand source code analysis (duplicates of scalability, data migration, and redistribution, copy across the migration path)] () [[kafka operations 】 enlarge shrinks capacity, data migration, a copy of the duplicates of redistribution, across the path of migration] ()
Operation and maintenance control of Kafka’s soul mate Logi-Kafkamanger (4) — cluster operation and maintenance (data migration and cluster online upgrade)
If you don’t want to bother, just look at the graph I’ve drawn below and you can figure out for yourself what might go wrong; And how
All anomalies
Kafka column sorting address please stamp here
1. If TargetBroker is not online, migration script execution will fail
TargetBroker if not online
When the task script is executed, the validation will not even pass
Scene demonstration
BrokerId | role | state | A copy of the |
---|---|---|---|
0 | Common Broker | normal | test-0 |
1 | Common Broker | outage | There is no |
Now migrate the partition topic-test-0 from Broker0 to Broker1
sh bin/kafka-reassign-partitions.sh --zookeeper xxxxxx:2181/kafka3 --reassignment-json-file config/reassignment-json-file.json --execute --throttle 1000000
Perform abnormal
Partitions reassignment failed due to The proposed assignment contains non-existent brokerIDs: 1
kafka.common.AdminCommandFailedException: The proposed assignment contains non-existent brokerIDs: 1
at kafka.admin.ReassignPartitionsCommand$.parseAndValidate(ReassignPartitionsCommand.scala:348)
at kafka.admin.ReassignPartitionsCommand$.executeAssignment(ReassignPartitionsCommand.scala:209)
at kafka.admin.ReassignPartitionsCommand$.executeAssignment(ReassignPartitionsCommand.scala:205)
at kafka.admin.ReassignPartitionsCommand$.main(ReassignPartitionsCommand.scala:65)
at kafka.admin.ReassignPartitionsCommand.main(ReassignPartitionsCommand.scala)
2. TargetBroker crashed during the migration process, causing the migration task to be in progress
Usually this happens when you write to a node
/admin/reassign_partitions/
After that, we have one over N
targetBroker
An outage prevents the Broker from creating new replicas, synchronizing the Leader operations and moving forward
Scene demonstration
To simulate this situation, we can manually write the node/admin/reassign_partitions/
Redistribute information such as:
-
Create a node to write the following information, where Broker-1 is not online; [Fixed] The simulation crashed during allocation;
{"version":1,"partitions":[{"topic":"test","partition":0,"replicas":[1]}]}
- see
/broker/topics/{topicName}
The node in the -
Next you should send a LeaderAndIsr request to Broker-1 to create a replica and synchronize the Leader; But Broker-1 is not online at this time; As a result, the task is still in progress. If you want to do some other reassignment, you will be prompted as follows
There is an existing assignment running.
The solution
Once you know what is going on, you have a clear idea of how to solve the problem. Just restart the Broker that failed.
3. The migrated replica cannot find the Leader, so TargetReplica cannot synchronize the replica all the time
As long as the Leader service of the migrated replica is suspended and a new Leader has not been elected, there is no place for synchronization
This is similar to scenario 2, but is different in that another Broker may have failed
Scene demonstration
BrokerId | role | state | A copy of the |
---|---|---|---|
0 | Common Broker | normal | There is no |
1 | Common Broker | outage | test-0 |
Now migrate the partition test-0 from Broker1 to Broker0
{"version":1,"partitions":[{"topic":"test","partition":0,"replicas":[0],"log_dirs":["any"]}]}
Look at the picture above,TargetReplica
Will receiveLeaderAndIsr
The copy is then created, and it is written in ZKTargetBroker
AR information of;
Then start to synchronize the replica information of the Leader. Who is the Leader at this time? Is the Broker – 1test-0
; (only one copy), and then ready to desynchronize,OriginBroker
If you’re not online, you can’t synchronize, soTargetReplica
The copy is created, but the data has not been synchronized; The following
TargetReplica
Created, but no data; And becauseOriginBroker
Broker0 is not online, so there is no deleted copy (kafka-log-30 below is Broker0; Kafka – logs – 31 is Broker1)-
Because the entire partition reassignment task was not completed, /admin/reassign_partitions/ has not been deleted
{“version”:1,”partitions”:[{“topic”:”test”,”partition”:0,”replicas”:[0]}]}
- The node in /broker/topics/{topicName} will be updated to the following figure, where
AR
RR
They haven’t been cleared yet brokers/topics/test/partitions/0/state
The node looks at the Leader as -1 and is not added to the ISRTargetBroker
As long as the synchronization is not successful, the whole partitioning process will continue;
The solution
If the OriginBroker fails and one copy goes offline, the other replicas will assume the role of Leader If there is only one copy, this will cause the exception, and you simply need to restart the OriginBroker
4. The current limit causes the redistribution to never be completed
Kafka column sorting address please stamp here
We generally do partition copy reassignment task, will generally add a flow limit value
--throttle
: Transfer rate between brokers during migration, in bytes/ SEC
Note that this value is a flow limit between brokers, not just for the several partitioned copies of the migration; It is the traffic that contains the normal data synchronization of the other topics themselves. So if you set the current limit very low, the rate is even lower than the normal synchronization rate
Or if your synchronization rate is slower than your message creation rate, the task will never be completed!
Scene demonstration
-
Creates a reassignment task with a current limit value of 1
sh bin/kafka-reassign-partitions.sh --zookeeper xxxx:2181/kafka3 --reassignment-json-file config/reassignment-json-file.json --execute --throttle 1
- Basically this rate is never going to be done,
admin/reassign_partitions
The node is always - Current limiting configuration in ZK
The solution
Set the current limiting threshold a little higher, rerun the above script, and increase the current limiting value
sh bin/kafka-reassign-partitions.sh --zookeeper xxxx:2181/kafka3 --reassignment-json-file config/reassignment-json-file.json --execute --throttle 100000000
Be sure to verify the end of the task to remove the current limit value! Otherwise he will always exist;
5. Data volume is too large, synchronous thief is slow
This situation is a very common thing, it is not an exception, you can’t do anything about performance problems, but often we do data migration will ignore a problem;
There is so much outdated data that it makes no sense to migrate it.
Check out my previous post
Operation and maintenance control of Kafka’s soul mate Logi-Kafkamanger (4) — cluster operation and maintenance (data migration and cluster online upgrade)
Reducing the migration of valid data can greatly increase the efficiency of data migration;
The solution
< font color = red > cutting down on the amount of data migration < / font > if you want to migrate the Topic has a large amount of data (the default reservation 7 days), can be dynamically adjusted before migrating temporary retention. Ms to reduce the amount of data; Of course doing this manually is really annoying, but you can be smarter about it
Operation and maintenance control of Kafka’s soul mate Logi-Kafkamanger (4) — cluster operation and maintenance (data migration and cluster online upgrade)
Visualized data migration, partition copy redistribution;
Set the current limit, reduce the amount of data migration, and automatically clean the current limit information after migration
Search for problem ideas
I’ve listed every possible solution I can think of to the problem above; So there’s an encounter
How do you quickly locate and resolve the reassignment that is going on all the time? There is an existing assignment running.
1. Look at the data in/ admin/reassign_partitions
Suppose a task is as follows; There are two partitions on the Broker[0,1] test-1 partition on the Broker[0,2]
{" version ": 1," partitions ": [{" topic" : "test", "partition" : 0, "replicas" : [0, 1]}. {" topic ":" test ", "partition" : 1, "replicas" : [0, 2]}]}
Broker1 in the figure is down,test-0 will not complete, and test-1 will complete. The /admin/reassign_partitions node is
{" version ": 1," partitions ": [{" topic" : "test", "partition" : 0, "replicas" : [0, 1]}]}
<font color=red> > <font color=red> > <font color=red> > < / font > < / font > </font>
The test-0 partition is incomplete and the corresponding Broker is [0,1].
2. Brokers /topics/{TopicName}/ Partitions /{partition number}/state
I know by step 1 test – 0 there is a problem, I just directly node/brokers/switchable viewer/test/partitions / 0 / state data Here are two kind of situations
-
The following
{"controller_epoch":28,"leader":0,"version":1,"leader_epoch":2,"isr":[0]}
ISR:[0], only 0; Normally it should be [0,1] as I set above; The problem is that a copy of Broker 1 is not in the ISR; The next question is to check why Broker 1 is not added to ISR;
-
As follows, Leader :-1 case
{"controller_epoch":28,"leader":-1,"version":1,"leader_epoch":2,"isr":[0]}
Leader :-1 means there is no leader currently; Newly added replicas have no place to synchronize their data and are confused. So the next thing to check is whether all the other copies of the TopicPartition are down. How do I determine the other brokers? To see if AR is normal; AR data can be seen in brokers/topics/{topicName};
Of course you canDidi-Logikm one-stop Kafka monitoring and control platformTo make it easier to check this step; The following
3. Follow step 2 to determine if the corresponding Broker is abnormal
If a Broker exception is found, the restart is complete.
4. Query current limit size
If step 3 has not resolved the problem and there are no Broker exceptions, then it is time to review the traffic limitation issue
- Let’s first look at the nodes
/config/brokers/{brokerId}
Whether the current limiting information is configured; - And node
/config/topics/{topicName}
The information of - If you see that the Broker node is not included in the ISR, you will have no synchronization rate problem
-
If the current limit value found in the query is relatively small, it can be appropriately increased
sh bin/kafka-reassign-partitions.sh --zookeeper xxxx:2181/kafka3 --reassignment-json-file config/reassignment-json-file.json --execute --throttle 100000000
5. Re-perform a reassigned task (stop the previous task)
If the above still does not solve the problem, then it may be that you copy too much data, migrated too much data, or your targetBroker network is not good, etc., the network transmission has reached the limit, which belongs to the problem of performance bottlenecks, perhaps you should consider whether to redistribute the problem; Or find a dead of night to do the redistribution operation;
Scene demonstration
- The test-0 partition, which was only in the Broker [0], is now reassigned to [0,1], using the
--throttle 1
Simulate the slow network transmission rate, performance bottlenecks, etc
This node is always there, always in progress, and 'adding_replicas' is always showing [1]
- You can also see that Broker-1 is alive
- But it’s not in the ISR
- Judged that the synchronization rate may be worse, the TargetBroker may not be in a good network condition, or the pressure itself is also quite large; Change the TargetBroker
-
Delete the node /admin/reassign_partitions directly, and then re-perform the reassignment task; Reallocated to [0,2]
{" version ": 1," partitions ": [{" topic" : "test", "partition" : 0, "replicas" : [0, 2]}]}
You can see that the new allocation has been written to zk;
But there are no changes in the Topic nodeAR
和ARS
This is becauseController
Although received the node’s new notification/admin/reassign_partitions
; However, during validation, it has stored the previous reassignment task in its memory. Therefore, for the Controller, it thinks that the previous task has not ended properly, so it will not use the backdoor process. - Re-elect the Controller role, reload
/admin/reassign_partitions
; As I analyzed in the article [[Kafka source code] Controller boot process and election process source analysis](), the Controller reelection is reloaded/admin/reassign_partitions
Node and continue task execution; After the switch, it looks like this. The change is normal
<font color=red > </font> </font> </font> </font> </font
There are simpler ways, of courseDidi LogiKM one-stop Kafka monitoring and control platformThe following
Specify some idle brokers as controllers,And switch right awayIs a wise choice;
The solution
- The data volume is too large because there is a lot of out-of-date data; If you don’t consider cleaning up stale data when you redistribute; So let’s just redistribute
But only one task can be assigned at a time, so you can only delete it by force/admin/reassign_partitions
; And then redistribute it;
Note that when redistributing, please be sure to set temporary data expiration time, reduce the migration of data; And also to letController
Let’s switch; - To sum up,
①. Delete the node/admin/reassign_partitions
②. Re-execute the reassigned task
(3).Controller
Repeat election
Checking tools + thinking
After analyzing the above problems, it is still quite troublesome to start the investigation of this problem. Look at this index and so on.
</font> </font> </font> </font> </font> </font> </font> </font> </font> </font> </font> </font> </font> </font> </font> </font> </font> </font> </font>
Since the investigation idea, visualization, automation, tool is not difficult it;
So I am
Didi LogiKM one-stop Kafka monitoring and control platformWe are going to raise an ISSUE to simply implement such a function.
See when the time is free to complete it, if you are interested, can also complete it together!
Real Case Analysis
On Friday, when I was about to get off work, one of my classmates asked me the following question, and then I replied.
Later, for specific analysis, a small group was brought together to look for clues
Progressive group plus V: jjdlmn_
In the process of partition redistribution, this student lasted for a long time and was in progress. Later, he went to Baidu and asked to delete the redistribution task node in ZK. I alerted the node and immediately deleted the node, only to find that one of the migrated TargetBroker failed. After they were restarted, the reassignment task continued, meaning that the TargetBroker was able to complete the copy assignment as normal.
Problem analysis
In fact, this problem is the second case we analyzed above. 2. TargetBroker failed during the beginning of the migration process, causing the migration task to be in progress all the time.
The task cannot be completed because the TargetBroker is down. At this point, just restart the TargetBroker;
Although they directly removed the node /admin/reassign_partitions; It’s not a big problem; The next time the reassignment task is started, the Controller memory will still contain the previous message, so the next task will not be executed. But if you reassign the Controller, then it just keeps going, it doesn’t matter;
Although they delete nodes this time, they also start the next allocation inside; But because it restarts the TargetBroker; Let the original task smoothly carried on; Even if you do not switch the Controller, it will not affect the next reassignment; (The previous one for which the Controller was notified is over because it went smoothly)
If you have any other exceptions that may occur, or other questions about Kafka, ES, Agent, etc., please contact me and I will add to this article
Star
To build
Didi LogiKM one-stop Kafka monitoring and control platform
Kafka column sorting address please stamp here