This is the fourth day of my participation in the August Text Challenge.More challenges in August

Rebalance, or Rebalance, is the process of getting all Consumer instances in a Consumer Group to agree on how to consume all partitions of a subscribed topic. In the Rebalance process, all Consumer instances work together to allocate the subscription topic partitions with the help of the coordinator component. However, none of the instances can consume any messages during the entire process, so it has a significant impact on the Consumer’s TPS

The downside of rebalancing

  1. Rebalance Affects the TPS on the Consumer. During the Rebalance, a Consumer will stop doing nothing.
  2. Rebalance is slow. If your Group has a large number of members, it will be down for a long time
  3. Rebalance is not efficient. The current Kafka design mechanism dictates that all members of the Group Rebalance every time. Locality is not considered, but it is especially important to improve system performance.

Unfortunately, there is nothing I can do to avoid or solve the rebalancing problem, but we can avoid unnecessary rebalancing and reduce the number of rebalancing operations.

How to avoid rebalancing

When rebalancing occurs

  1. The number of group members has changed
  2. The number of subscribed topics has changed
  3. The number of partitions subscribed to the topic changed

Because the latter two are often operational initiatives, most of the Rebalance they cause is inevitable (and rarely happens in production). We focused on how to avoid making this Rebalance because of changes in the number of group members.

Group number changes rebalance

Here’s the idea of the coordinator

A Coordinator in Kafka is a Consumer Group that performs Rebalance and provides shift management and Group membership management for the > Group. Coordinator components are created and started when all brokers are started. The Consumer Group Kafka internal shift theme __consumer_offsets determines which Broker the Coordinator serving it is on.

Kafka’s algorithm for determining the Broker where a Coordinator resides for a Consumer Group has two steps.

  1. Determine which partition of the shift topic holds the Group data:partitionId=Math.abs(groupId.hashCode() % offsetsTopicPartitionCount).
  2. Locate the Broker where the Leader copy of the partition resides. The Broker is the Coordinator.
  1. Add a Consumer instance

When we start a Consumer program configured with the same group.id value, we actually add a new Consumer instance to the group. At this point, the Coordinator accepts the new instance, adds it to the group, and reassigns partitions. Typically, adding a Consumer instance is planned, perhaps for increased TPS or scalability. All in all, it’s not one of those unnecessary Rebalance we want to avoid.

  1. Reducing Consumer instances

Stop some Consumer instances, of course. The point is that in some cases a Consumer instance can be “kicked” out of the Group by a Coordinator who mistakenly believes it has “stopped.” If this is causing the Rebalance, you need to take steps to avoid it

Abnormal withdrawal of Consumer

  1. Failed to send heartbeat in time, causing the Consumer to be “kicked out” of the Group

    Each Consumer instance periodically sends a heartbeat request to a Coordinator to indicate that it is still alive. If a Consumer instance fails to send heartbeat requests ina timely way, the Coordinator will consider the Consumer “dead” and remove it from the Group. The Coordinator then makes a new Rebalance.

    • The Consumer side has an argument calledsession.timeout.msThe default value of this parameter is 10 seconds. If a Coordinator does not receive a heartbeat from a Consumer instance in the Group within 10 seconds, it considers the Consumer instance to have been suspended.
    • In addition to this parameter, Consumer provides a parameter that allows you to control how often heartbeat requests are sent, namelyheartbeat.interval.ms. The smaller this value is set, the more frequently the Consumer instance will send heartbeat requests. The Coordinator can tell each Consumer instance that it is making the Rebalance. The way the Coordinator is notifying each Consumer instance that it is making the Rebalance is by making the user RebalanceREBALANCE_NEEDEDFlags are encapsulated in the response body of a heartbeat request.

We can set

  • Set session.timeout.ms to 6s.
  • Set heartbeat.interval.ms to 2s.

Ensure that the Consumer instance can send at least 3 rounds of heartbeat requests before being judged “dead”.

  1. The Consumer takes too long to consume

    The Consumer side has a parameter that controls how the Consumer’s actual consumption affects the Rebalance. This parameter is called the max.poll.interval. It limits the maximum interval between two calls to the poll method by the Consumer application. The default value is 5 minutes, which means that if your Consumer cannot consume the poll message within 5 minutes, the Consumer will initiate a “leave the group” request and the Coordinator will start a new round of Rebalance.

If our business news consumption is very long, we can dispatch this parameter appropriately, if our business spending time is not long, but after a long time to submit displacement, so we need to check whether our process caused by Full GC happened frequently, if it is, need to tuning of the JVM, For JVM tuning solutions, see the nanny level JVM tuning exercise

conclusion

  1. Rebalancing affects the throughput of our program
  2. Find out the cause of rebalancing to avoid rebalancing in abnormal cases on the Consumer end
  3. Avoid unnecessary rebalancing by setting parameters on the Consumer side or by tuning the JVM
  • session.timeout.ms
  • heartbeat.interval.ms
  • max.poll.interval.ms
  • The GC parameter