Kafka workflow and file storage mechanism

The working process

Kafka’s messages are categorized by topic, with producers responsible for producing messages and consumers responsible for reading them.

Topic is a logical concept, partition is a physical concept, each partition corresponds to a log file, producer continuously apend data to the end of the file. And each piece of data has its own offset, and each consumer will record its own offset, so that when there is a problem, the consumption starts from the last consumption offset.

Kafka file storage mechanism:

Messages generated by producers are constantly appended to the end of the log file. When the log file becomes too large, data location becomes inefficient. Kafka uses fragmentation and indexing to divide each partition into multiple segments.

Segment: Each segment is divided into two files:.index file and.log file

Topic +partition segment: specifies the offset of the first message in segent.

The index file records the offset and physical location of the message in the file. O (1) level complexity query algorithm is implemented through index file. The granularity of the index file can be set through the configuration file.

The process of finding a message

** To find a message at position 170417 :**I

Offset 170417 from filename: 170417-170410 = 7. Use binary search to find the maximum number equal to or less than 7. 4, get “4,476” recently, locate in the log file of 476 look down.

Data expiration mechanism

You can set the file to be closed when the file reaches a certain size or when the file is open (in the write state). Active files (opened) are never deleted, only closed files are deleted, if the setting is to keep data for one day, but the file stores data for five days, then it will be deleted after five days.

Kafka producers

Partitioning strategies

Advantages of zoning:

1. Multi-partition distributed storage, easy to expand. 2. Read and write data concurrently to improve the speed. 3. Speed up data recovery. When a machine is abnormal, only this part of data can be recovered.

Ensure data reliability
  • If A producer writes to A partition before B, Kafka guarantees that B’s OSFFset is larger than A’s, and that the consumer reads A’s data first.
  • Data is considered committed only if it has been written to all replicas, consumers can only read data for submission, and producers can accept different acknowledgements (acknowledgements when fully committed, acknowledgements when writing to chiefs, acknowledgements when messages are sent to the network)
  • As long as another copy is active, the committed information is not lost

Copy:

Kafka’s replication mechanism and partitioning ensure that Kafka is reliable. Other replicas only need to synchronize the leader’s data in a timely manner

The Leader maintains a consistent set of followers called the ISR. If a follower is not consistent with the Leader for a long time, it is kicked out of the ISR. The conditions are as follows: There is an active session with ZooKeeper in a certain period of time and the latest data of the leader is obtained in a certain period of time

If the leader fails, one of the followers switches to the leader

Ack response mechanism

Kafka offers three levels of reliability, and you can choose between the data required for different reliability requirements.

  • 0: Producer does not wait for the broker’s ACK and returns when the broker receives data that has not been written to disk.
  • 1: The producer waits for the ACK after the leader falls. (1 indicates that the producer waits for several partitions to synchronize data and then returns.)
  • -1: Produce waits for the ACK of the broker and returns an ACK only after all the followers and leader in the ISR have dropped.
Consumption consistency assurance

When the follower fails, the follower is kicked out of the ISR. When the follower recovers, the follower intercepts the HW higher than before the failure and resynchronizes from the previous HW. If the follower LEO is greater than or equal to the HW of the partition, the follower has caught up with the leader. You can rejoin at this point. When the leader fails, a new leader is selected from the ISR. To maintain consistency, other followers cut off the parts higher than the HW and synchronize data from the new leader. TIPS: This only guarantees partition consistency, not data loss.

Message Sending Process

Kafka producer adopts the asynchronous sending mode, through two threads and a shared variable. The main thread is responsible for writing data to the shared variable RecordAccumulator, and the sender thread pulls the message from RecordAccumulator and pushes it to Kafka.