Kafka Consumer

Consumer API clarify

在 Kafka 0.9.0Before the version,ConsumerThere are two sets ofAPI, a set of advancedAPI, a set of lowAPI

High-level Consumer apis and low-level Consumer apis

However, since the release of Kafka 0.9.0, new Consumer apis have been provided, and many new changes have taken place since the release of Kafka 0.9.0. When learning, pay attention to identify and prevent learning some outdated and obsolete knowledge

The link above, for example, although the summary is very good, but it is a cliche, and publication date is in August 2019, can let a person think this is a relatively new knowledge, actually I have been misled, but I was concerned that the version in the learning process related problem, all soon found that the knowledge of it is outdated

Consumer

A Consumer is the Consumer side of a Kafka that consumes messages from a Topic, specifically a partition of a Topic, and further, The Consumer consumes a message for Leader Replication in the Partition

A Topic consists of multiple partitions. A Partition consists of a Leader Replication and multiple Follower Replication

The Consumer works by making a FETCH request to the broker of the partition it wants to consume. The Consumer specifies in each request the message that it wants to consume, the offset in the log, and receives the log block from that location. As a result, the consumer has a lot of control over the location and can even rewind the tape and re-consume if needed

A push or pull

Generally, messages in message-oriented middleware can be consumed in one of two ways

Push: The server proactively pushes data to the consumer
“Pull” : The consumer pulls messages from the server

These two methods have their own advantages and disadvantages. If we choose the Push method, the biggest problem is that the server side is not clear about the consumption speed of the consumer side, which may lead to the consumption failure of the consumer side.

The Pull approach solves this situation, where the consumer can Pull the data according to its state, and the server can defer the processing of the data. However, the disadvantage of this approach is that the server side may keep polling when there is no data available

Kafka Consumer uses a Pull approach. By default, the message is pulled every 100ms. To solve the problem that the server may always poll when there is no data on the server, Kafka allows Consumer requests to block in a “long poll”. Wait for data to arrive (and optionally wait until a given number of bytes are available to ensure transfer size).

Consumer Group

What is a Consumer Group

A Consumer Group is a Group of Consumers. Starting in Kafka 2.2.1, consumers must explicitly specify a Consumer Group ID or they will not be able to subscribe to a Topic and submit an offset

Why do we need Consumer Groups

For two reasons

Organize multiple consumers together to concurrently consume the same topic, providing consumption performance
Implement message broadcast

Concurrent consumption

To start with, we know that a Topic in Kafka consists of multiple partitions. If there is only one Consumer, all partitions in that Topic will be consumed by the Consumer.

What if the production of messages is so fast that a Consumer may not be able to consume them, and the messages will pile up on the message queue, which may be overwhelmed if the message queue does not come out for a long time? If one Consumer can’t spend enough money, I will use several consumers to consume this topic.

This is also the answer to the interview question, what should we do with the news piling up — add customers

For example, Topic 1 has three partitions, and we enable three consumers to consume at the same time, each

A Consumer only consumes a partition. This organization form in which multiple consumers consume a Topic is called a Consumer Group

Note that the more consumers, the better. In a Consumer Group, if the number of consumers for a Topic exceeds the number of partitions for that Topic, the more consumers will not be able to consume data

Kafka greatly improves Consumer client performance through the Consumer Group’s concurrent consumption design

News broadcast

Sometimes a message from a Topic needs to be consumed by more than one person. For example, for the Topic of an order, advertising service may need to analyze users’ likes to achieve accurate push, while anti-fraud service may determine whether the order is fraudulent or not

Of course, the above requirements can be achieved by putting them all into a single consumer process, but Kafka provides a much more elegant implementation, message broadcasting. When a message arrives, Kafka broadcasts it to all Consumer groups that subscribe to the Topic

We can define multiple Consumer groups. Each Consumer Group is a Consumer, which is isolated from each other and does not affect each other. Each Consumer Group can consume a complete Topic, and the consumption progress is also isolated from each other. You could have one Consumer Group spending faster and the other Consumer Group spending slower

Consumer Group summary

Starting with Kafka 2.2.1, each Consumer must specify the Consumer Group
Consumer groups are isolated from each other, and each Consumer Group can consume a complete Topic
In a Consumer Group, the number of consumers for a Topic should not exceed the number of partitions for the Topic to avoid resource waste

Partition allocation policy between Consumer and Topic

From the above, we know that a Topic is consumed by all consumers in a Consumer Group. Then the problem arises. Suppose there are three consumers in a Consumer Group and a Topic has seven partitions. So how do the three consumers divide their consumption among the seven partitions?

The Kafka Consumer client offers three allocation strategies to address this issue:

RangeAssignor default values
RoundRobinAssignor
StickyAssignor

RangeAssignor Assignment policy

The RangeAssignor strategy is based on dividing the total number of consumers by the total number of partitions to obtain a span, and then evenly allocating the partitions according to the span to ensure that the partitions are distributed as evenly as possible to all consumers.

Suppose anTopicThere are seven partitions. When there are 1, 2, and 3 consumers, the distribution is as follows

When the number of consumers exceeds the number of partitions, there will be no partitions allocated to consumers

RoundRobinAssignor Assignment policy

The principle of RoundRobinAssignor policy is to sort all consumers in a consumer group and all the partition of the topic subscribed to by consumers in lexicographical order, and then assign the partition to each consumer one by one through polling.

Kafka client can partition. The assignment. The strategy parameter is set to org. Apache. Kafka. Clients. Consumer. RoundRobinAssignor to start the strategy

If the subscription information of all consumers in the same consumer group is the same, the partition allocation of RoundRobinAssignor policy is uniform.

For example, if a consumer group has two consumers, C0 and C1, both subscribed to topic 0 and Topic 1, and each topic has three partitions, the allocation is as follows

However, if consumers in the same consumer group subscribe to different information, then the partition allocation is not completely polling, which may lead to uneven partition allocation

For example, there are three consumers C0, C1, and C2 in the consumer group that subscribe to a total of three topics: T0, T1, and T2, which have one, two, and three partitions, respectively. Consumer C0 is subscribed to topic T0, consumer C1 is subscribed to topics T0 and T1, and consumer C2 is subscribed to topics T0, T1 and T2, so the final distribution result is

StickyAssignor Assignment policy

Kafka has introduced StickyAssignor since version 0.11.x. It has two main purposes:

Partition distribution should be as even as possible
Partition allocation should be the same as the last allocation as possible

When the two are in conflict, the first goal takes precedence

The above scenario, if usedStickyAssignorAllocation policy, the result is as follows

Balance (Rebalance)

No, it’s not. Sometimes when consumers join or leave a Group, the balance is broken and they need to Rebalance. This is why this process is called rebalancing

All consumers in the Consumer Group need to stop consuming during the Rebalance and wait until the partition reallocation is complete before they can start consuming again. All consumers are unable to consume data during the Rebalance. So in actual production, you want to avoid rebalancing as much as possible.

The timing of the Rebalance

There are two main categories

The Topic partition changed. Procedure
- The number of Topic partitions increased
The number of consumers in the Consumer Group sent changes
- Consumer exits the Consumer group. For example, Consumer is down
- A new Consumer joins the Consumer group

The best way to avoid this is to Rebalance. As users, we have no control over the service side of the service. We have to work on the client side, such as the number of consumers in a fixed Consumer Group