A Kafka troubleshooting process illustrates the architectural thinking an architect must have

This article is the fourth Kafka series, starting from the problem, so as to discuss the actual cluster partition migration, the underlying principle and the need to consider the problem of operation and maintenance.

To master one or two Java mainstream middleware is a necessary skill to knock on BAT and other big factories. It gives you a learning route of Java middleware and helps you realize the transformation of the workplace.

Java advanced ladder, growth path and learning materials to help break through the middleware field

1. Problem description

Kafka IO Exception(many open files) Kafka IO Exception(many open files)

The problem occurred in the development environment of the company. In order to avoid information leakage, I conducted local simulation, which did not affect the analysis and learning of this problem.

2. Problem analysis

Insert a picture description here

Kafka-manager: Kafka-Manager: kafka-Manager: Kafka-Manager: Kafka-Manager: Kafka-Manager: Kafka-Manager: Kafka-Manager: Kafka-Manager: Kafka-Manager: Kafka-Manager: Kafka-Manager: Kafka-Manager

  • Topic topic names

  • Partitions the partition number

  • Brokers Number of Brokers in the topic queue.

    Brokers Spread % The usage of queues among Brokers in this topic. For example, if there are five Brokers in a cluster, but queues are created on only four Brokers, the usage is 80%.

    Brokers Skew % Topic Queue Skew rate. If there are five broker nodes in the cluster, the total number of partitions for the topic is 4 and the replica factor is 2, but the queues are distributed among only four brokers. That topic has a broker Spread of 80%.

    As we all know, the purpose of introducing multiple nodes is load balancing. The distribution of queues among brokers is naturally expected to be as balanced as possible. It is expected to store 2 queues (copy factor 2, total 8 queues) on each broker, indicating that there is no skewness. The Skew is calculated as the number of Brokers in excess of the average number of queues divided by the total number of Brokers. The Brokers Skew is equal to (1/3)=33%.

    Brokers Leader Skew % Topic Partition Skew rate of the Leader partition. In Kafka, only the Leader node of a partition has read and write permissions. What really affects the read and write performance is whether the Leader partition is balanced. Imagine if a topic has six partitions, but all the Leader partitions are distributed on only one or two Broker nodes. The write and read performance of this topic will be constrained, and it is recommended to maintain this value at 0%.

    Replicas are the number of Replicas of a partitioned data store, including the Leader partition.

    In Kafka’s replication model, the primary partition is responsible for reading and writing, and other replicas in the replication group synchronize data from the primary node. If they cannot keep up with the replication progress of the primary node, they will be proposed to the ISR. The deleted REPLICas in the ISR are not eligible to elect the Leader. If this number is higher than 0 for a long time or frequently, the cluster must have a problem.

    The real-time TPS is collected using JMX. The following parameters need to be enabled in Kafka-Manager:

  • Summed up the maximum current message offset of the topic.

After observing the Topic list, it is found that a large number of topics in the development environment have only one queue and are distributed on the first node. The screenshot is as follows:

Insert a picture description here

From the corresponding index on the interface: Brokers Spread, that is, the utilization rate of the Broker is only 1/3, extract several topics with large data volume, judge their routing information, and learn that they are all distributed on the first Broker node, resulting in a large number of errors mentioned at the beginning of the article on one of the nodes: Too many Open files.

3. Solutions

3.1 enlarge the partition

The problem was identified as a large number of topics created only one queue and clustered on the first node due to uneven Broker utilization.

In view of this situation, the first scheme that comes to mind: expanding partition.

3.1.1 by Kafka – manager

Step1: in the topic list of kafka-manager, click on the specific topic to go to the details page and click on [add Partitions], as shown in the figure below:

Step2: Click add partition, and the following box will pop up:

The instructions are as follows:

  • The total number of Partitions after expansion is not the number of Partitions added this time.

  • The Brokers partition requires distributed Brokers. You are advised to select all Brokers to fully utilize the performance of the entire cluster.

3.1.2 O&M Commands

Topic partitioning can be modified using the kafka-topics command provided by Kafka:

Tips: It doesn’t matter if you are not familiar with these o&M commands. Basically, –help is provided

3.2 Zone Movement

Since there are a large number of topics with only one partition, and they are distributed to the first node, is it possible to move some topic partitions to other nodes?

Here’s how partition movement works.

3.2.1 kafka – manager

Step1: Enter the topic details page and click [Generate Partition Assignments], as shown in the picture below:

Step2: after entering the page, select the brroker you want to migrate to and change the replicator factor of topic. Finally, click [Generate Partition Assignments], as shown in the picture below:

Step3: After clicking “Finish”, only partition migration plan is generated at this time, but there is no real execution. You need to click “Reassign Parttions” button.

3.2.2 O&M Commands

Step1: first we need to prepare the topic information to perform the migration. For example, save the following information in the dw_test_kafka_040802-topics to-move-json file.

{"topics":
    [
        {"topic":"dw_test_kafka_040802"}
    ],
    "version": 1
}

Copy the code

Step2: run the kafka-reassignment-partitions. Sh command provided by kafka to generate an execution plan

Kafka-manager: kafka-manager: kafka-Manager: Kafka-Manager

  • Broker-list Partitions require distributed brokers. If there are more than one, use double quotation marks, for example, “0,1,2”.

  • –topics to move-json-file List of topics to be migrated.

  • — Generate means generate an execution plan (not actually executed)

After the partition is executed successfully, the current partition distribution plan and the new execution plan are displayed. You can store the current execution plan in a backup directory and the new plan in a file.

Step3: run the kafka-reassignment-partitions. Sh command provided by kafka to perform partition migration

The key points are as follows:

  • –reassignment-json-file Specifies the execution plan generated in the previous step.

After the command is executed Successfully, the message “Successfully” is displayed. Repartition is a very complex process. After the command is executed, the execution is not complete.

4. Advanced and structural thinking

Will migration of partitions via kafka-reassignment-partitions. Sh affect the normal usage of the business side? That is, does it affect the consumption and delivery of messages?

As an architect, it is essential to consider the scope of the business impact when making changes, especially to middleware, which directly affects the complexity of the implementation.

We need to further explore the implementation principle of partition migration, this article does not analyze in detail from the source point of view, just illustrate the implementation mechanism of partition migration.

Requirement: One of the partitions P0 of a TopicA, distributed on Broker ids 1,2, and 3, is currently migrated to BrokerIDS 4,5, and 6.

Before introducing the migration process, let’s define three variables:

  • Distribution of OAR partitions before migration.

  • RAR Partition distribution after migration

  • AR Partition distribution in the current running process

Combined with the above examples, the entire migration steps are as follows:

AR Leader(ISR) instructions
{1, 2, 3} {1, 2, 3}
{6} {1, 2, 3} Start by creating partitions based on the RAR collection (the new migrated broker) and start synchronizing data from the Leader
{6} 1 6} { The newly created replica catches up with the master node and enters the ISR collection
{6} 4 6} { If the Leader is not in the RAR collection, an election is initiated to change the Leader to one of the RSPS.
{6} 4 {4 and 6} Set the status of the copy in the OAR to OfflineReplica to remove it from the ISR
{4 and 6} 4 {4 and 6} Delete the offline copy to complete the migration

This process affects message sending and message consumption only during the Leader election. However, Zookeeper can respond to the Leader election at the second level. In combination with the buffer queue and retry mechanism at the Kafka message sending end, this process theoretically has no impact on services.

Well, this article is introduced here, three even (attention, like, message) is the biggest encouragement to me, of course, you can add the author wechat: DINGwPMz, common exchange and discussion.

Finally, share a core RocketMQ ebook with me and you will gain experience in the operation and maintenance of billions of message flows.

How to obtain: pay attention to the “Middleware Interest circle”, reply RMQPDF can be obtained.

Middleware interest circle

RocketMQ Technology Insider author maintenance, mainly into the system analysis of JAVA mainstream middleware architecture and design principles, to build a complete Internet distributed architecture system, help break the workplace bottleneck.

Walk into the author

Here’s some advice from a 10-year IT veteran for new employees

“I” was flattered by Ali Baba

How can programmers increase influence

How to read source code efficiently

Another way FOR me to get involved in the RocketMQ open source community