Kafka data loss mechanism and CAP theory in detail

“This is the 11th day of my participation in the Gwen Challenge in November. See details: The Last Gwen Challenge in 2021.”

1. The production data of the producer is not lost

Sending mode

Producers send data to Kafka, either synchronously or asynchronously

Synchronization mode:

After sending a batch of data to Kafka, wait for Kafka to return the result:

The producer waits 10 seconds, and if the broker does not respond with an ACK, it considers it a failure.
The producer tries three times, and if there is no response, an error is reported.

Asynchronous mode:

Send a batch of data to Kafka, providing only a callback function:

The data is first stored in the buffer on the producer side. The buffer size is 20,000 pieces.
Data can be sent if one of the data thresholds or quantity thresholds is met.
The size of a batch of data sent is 500.

Note: If the broker is slow to ack and the buffer is full, the developer can set whether to empty the buffer directly.

Ack mechanism (acknowledgement mechanism)

When producer data is sent out, the server needs to return an acknowledgement code, that is, ack response code. The ack response has three status values: 0,1, and -1

0: the producer only sends data and does not care whether the data is lost. The lost data needs to be sent again

1: The leader of the partition receives data. The status code of the response is 1, regardless of whether the data is synchronized

-1: All secondary nodes receive data and the status code of the response is -1

If the broker never returns an ACK state, the producer never knows whether he succeeded. The producer can set a timeout period of 10 seconds, after which the producer fails.

2. Data is not lost in the broker

The protection against data loss in the broker is mainly through copy factors (redundancy).

3. Consumer consumption data is not lost

When consumers consume data, as long as each consumer records the offset value, the data will not be lost. That is, we need to maintain our own offset, which can be saved in Redis.

1. CAP theory in distributed systems

Distributed systems are becoming more and more important. Large websites are almost all distributed.

The biggest difficulty of distributed system is how to synchronize the state of each node.

In order to solve the problem of state synchronization between nodes, in 1998, Eric Brewer, a computer scientist at the University of California, proposed three indicators of distributed system, which are:

Consistency:
-Leonard: Availability
Partition tolerance: fault tolerance of partitions

Eric Brewer says it is impossible to do all three at once. At most two of these conditions can be satisfied at the same time, and this conclusion is called the CAP theorem.

CAP theory states that in distributed systems, consistency, availability and partition tolerance can only be satisfied at most two at the same time.

Consistency: Consistency

The result of a write operation through one node is visible to subsequent reads through other nodes
If the data is updated, subsequent read operations can immediately sense the update in the case of concurrent access, which is called strong consistency
If some or all of the updates are not perceived after being allowed, it is called weak consistency
If the update is sensed after a later (usually variable) period of time, it is called final consistency

Availability: Availability

Any node that does not fail must return reasonable results within a limited time

Tolerance: Partition tolerance

When some nodes break down or cannot communicate with other nodes, the function of distributed system can be maintained among partitions

In general, partition tolerance is required. So under CAP theory, there is more of a trade-off between availability and consistency.

2. Partition tolerance

Let’s start with Partition tolerance.

Most distributed systems are distributed over multiple subnetworks. Each sub-network is called a partition. Partition fault tolerance means that interval communication may fail. For example, if one server is located in China and the other server is located in the United States, these are two zones and they may not communicate with each other.

In the figure above, G1 and G2 are two servers across regions. G1 sends a message to G2, which may not receive it. Systems must be designed with this in mind.

In general, partition fault tolerance is unavoidable, so you can assume that the P of CAP is always present. There is always the possibility of partition fault tolerance

3. Consistency

Consistency is called “Consistency” in Chinese. This means that any read operation that follows a write operation must return this value. For example, if a record is v0, the user initiates a write operation to G1 to change it to v1.Next, the user’s read will get v1. This is called consistency.The problem is that it is possible for the user to initiate a read operation to G2, and since the value of G2 has not changed, v0 is returned. The results of G1 and G2 read operations are inconsistent, which is inconsistent.

In order for G2 to change to V1, G1 should send a message to G2 asking G2 to change to V1 when G1 writes.

In this case, the user can read to G2 and get v1.

4. Availability

Availability means that servers must respond to user requests whenever they receive them. Users can choose to initiate read operations to G1 or G2. No matter which server receives the request, it must tell the user whether it is V0 or V1, otherwise availability is not satisfied.

CAP mechanism in Kafka

Kafka is a distributed message queue system. If kafka is a distributed message queue system, it must comply with CAP law. Which two of CAP’s laws does Kafka satisfy?

Kafka satisfies CA in CAP law, in which Partition tolerance adopts a certain mechanism to ensure fault tolerance of Partition as far as possible.

Where C stands for data consistency. A indicates data availability.

Kafka first writes data to different partitions and each partition may have several copies. Data is first written to the Leader partition. All read and write operations are communicated with the Leader partition to ensure data Consistency. Kafka then ensures the availability of data in Kafka through a partitioned replica mechanism. However, there is another problem, that is, how to solve the problem of the difference between the data in the replica Partition and the data in the leader, that is, Partition tolerance.

Kafka addresses the Partition tolerance problem by using ISR synchronization strategies to minimize the Partition tolerance problem.

Each leader maintains a set of In-Sync Replicas (ISR) list.

The main function of the ISR list is to determine which replica partitions are available, that is, data from the leader partition can be synchronized to the replica partition. There are two conditions to determine whether a replica partition is available:

Replica.lag.time.max. ms=10000 Heartbeat delay between the replica zone and primary zone
Replica.lag.max. messages=4000 Maximum difference in message synchronization between the replica zone and the primary zone

Produce specifies the confirmation value when the request is considered complete: request.required. Acks =0.

Ack =0: The producer continues to send the next (batch) message without waiting for confirmation that the broker has completed synchronization.
Ack =1 (default) : The producer waits for the leader to successfully receive and confirm the data before sending the next message.
Ack =-1: The producer sends the next data only after receiving follwer’s confirmation.