Kafka profile

Apache Kafka is a distributed publish-subscribe messaging system. It is the only king of message queue in the field of big data. Originally developed by linkedin in Scala, it was donated to the Apache Foundation in 2010 and became a top open source project. It has been more than a decade and remains an integral and increasingly important component of big data.

Kafka is suitable for both offline and online messages, which are kept on disk and replicated within the cluster to prevent data loss. Kafka is built on top of the ZooKeeper synchronization service. It has excellent integration with Flink and Spark for real-time streaming data analysis.

Kafka features:

  1. Reliability: replicas and fault tolerance.
  2. Scalability: Kafka can scale nodes and bring them online without downtime.
  3. Persistence: Data is stored on disk and persisted.
  4. Performance: Kafka has high throughput. Terabyte data, also has very stable performance.
  5. Fast: Sequential write and zero copy technology allows Kafka latency to be controlled in milliseconds.

Kafka underlying principles

Take a look at the architecture of Kafka systems

Kafka supports message persistence. The consumer side actively pulls data, and the consumption state and subscription relationship are maintained by the client side. After a message is consumed, it is not deleted immediately. Therefore, when multiple subscriptions are supported, only one copy of the message will be stored.

  1. Broker: A Kafka cluster contains one or more service instances (nodes). This service instance is called a broker.
  2. Topic: Every message published to a Kafka cluster belongs to a category called topic.
  3. Partition: A partition is a physical concept. Each topic contains one or more partitions.
  4. Segment: Each segment is divided into two parts:.log file and.index file. The.index file is used to query the offset position of data in the.log file.
  5. Producer: a producer of messages that publish messages to Kafka brokers.
  6. Consumer: the consumer of messages, the client that reads messages to Kafka’s broker;
  7. Consumer Group: A consumer group in which each consumer belongs to a specific consumer group (you can specify a groupName for each consumer);
  8. Log: stores data files.
  9. .index: stores the index data of the. Log file.

Kafka main components

1. The Big Bang Theory

Producer is mainly used to produce messages. They are the message producers in Kafka. The messages produced are classified through topics and stored in Kafka brokers.

2. Topic

  1. Kafka groups messages by topic;
  2. Topic refers to the different categories of feeds of messages that Kafka processes.
  3. Topic is the nominal name for a category or column of published records. Kafka themes have always supported multi-user subscriptions; That is, a topic can have zero, one or more consumer subscriptions to write data;
  4. In a Kafka cluster, there can be an infinite number of themes;
  5. Producer and consumer consumption data are generally subject – specific. Finer granularity can be achieved at the partition level.

3. Partition

In Kafka, a topic is a grouping of messages. A topic can have multiple partitions, each of which stores some data of the topic. The data of all partitions is combined to form all data of a topic.

Multiple partitions can be created within a single broker service, regardless of the number of brokers. In Kafka, each partition has a number: the number starts at 0. The data within each partition is ordered, but the global data is not guaranteed to be ordered. (Order refers to the order in which production is produced and the order in which consumption is consumed.)

4. consumer

A consumer is a consumer in Kafka that consumes data in Kafka. A consumer must belong to a consumer group.

A consumer group

A consumer group consists of one or more consumers who consume the same message only once in the same group.

Each consumer belongs to some consumer group, and if not specified, all consumers belong to the default group.

Each consumer group has an ID, the Group ID. All consumers within a group coordinate to consume all partitions of a subscribed topic. Of course, each zone can only be consumed by one consumer in the same consumer group, but by different consumer groups.

The partition number determines the maximum number of concurrent consumers in each consumer group. The diagram below:

As shown in the left figure above, if there are only two partitions, even if there are four consumers in a group, there will be two free ones. As shown above on the right, there are four partitions, one for each consumer, with a maximum concurrency of 4.

Take a look at the following picture:

As shown in the figure above, different consumer groups consume the same topic, which has four partitions distributed on two nodes. The consumer group 1 on the left has two consumers, and each consumer needs to consume two regions to complete the consumption of the message. The consumer group 2 on the right has four consumers, and each consumer needs to consume one region.

To summarize the relationship between partitions and consumer groups in Kafka:

Consumer group: A group of one or more consumers who consume the same message only once. The number of partitions under a topic must be less than or equal to the number of partitions under the topic for the number of consumers in the same consumer group consuming the topic.

For example, if a topic has four partitions, the number of consumers in the consumer group should be less than or equal to 4, preferably an integer multiple of 1, 2, and 4. Data in the same partition cannot be consumed by different consumers in the same consumer group at the same time.

Conclusion: The more partitions, the more consumers can consume at the same time, the faster the speed of data consumption will be, improving the performance of consumption.

6. Partition replicas

A partition copy in Kafka looks like this:

Replication-factor: The number of brokers on which a control message is stored, usually equal to the number of brokers.

Multiple replica factors cannot be created in a single broker service. When creating a topic, the replica factor should be less than or equal to the number of brokers available.

Copy factor operations are on a partition basis. Each partition has its own master and slave copies;

The primary replicas are called leader and the secondary replicas are called followers. (If multiple replicas exist, Kafka assigns a leader and N followers to all partitions in a partition.) The replicas in the synchronous state are called in-sync-Replicas (ISR).

The followers synchronize data from the leader by pulling. Both consumers and producers read and write data from the leader and do not interact with followers.

What the copy factor does: Makes Kafka reliable for reading and writing data.

Replica factors are contains themselves, and the same replica factors cannot be placed in the same broker.

If a partition is a copy of the three factors, even if one hang up, then only the remaining two, choose a leader, but not in other broker, start another copy (as in another start, data transfer, as long as there is data transfer between the machine, is a long time to take up network IO, Kafka is a high-throughput messaging system and this is not allowed to happen) so it will not start in another broker.

If all replicas hang, the producer will fail to write data to the specified partition.

LSR indicates the copy that is currently available.

7. The segment file

A partition consists of several segment files. Each segment file contains two parts: the.log file and the.index file. The.log file contains the data store that we sent to the partition. Record the data index value of our.log file, so that we can speed up the data query speed.

Relationship between index files and data files

Since they are one-to-one pairs, there must be a relationship. The metadata in the index file points to the physical offset address of Message in the corresponding data file.

For example, the index file 3,497 represents the third message in the data file with an offset of 497.

In the data file, Message 368772 represents the 368772 Message in global Partiton.

Note: The segment index file adopts the sparse index storage mode to reduce the size of the index file. Mmap (memory mapping) allows direct memory operation. Sparse index sets a metadata pointer for each corresponding message of the data file, which saves more storage space than dense index. But it takes more time to find them.

The relationship between.index and.log is as follows:

The left part of the image above is the index file, which stores a pair of key-values, where the key is the number of the message in the data file (the corresponding log file), such as “1,3,6,8…” , respectively, the first message, the third message, the sixth message, the eighth message in the log file……

So why aren’t these numbers sequential in the index file? This is because instead of indexing every message in the data file, the index file uses sparse storage, with an index built every certain byte of data. This prevents the index file from taking up too much space and keeps the index file in memory. However, the disadvantage is that messages that are not indexed cannot be located in the data file at the same time, so a sequential scan is required, but the range of this sequential scan is very small.

Value represents the number of messages in the global Partiton.

Take metadata 3,497 in the index file, where 3 represents the third message from top to bottom in the log data file on the right, and 497 represents the physical offset address (location) of the message is 497(also represents the 497th message – sequential write feature in the global partiton).

Kafka creates folders in the log.dir directory we specify. The name is the folder made up of (theme name – partition name). Under the directory (topic name – partition name), there will be two files, as follows:

#Index file
00000000000000000000.index
#Log contents
00000000000000000000.log
Copy the code

The files in the directory will be split according to the size of the log file. When the size of the. Log file is 1 GB, the files will be split. As follows:

Rw - r - r -. 1 root root on January 17, 389 k 18:03 00000000000000000000. The index - rw - r - r -. 1 January 17 18:03 root root 1.0 G 00000000000000000000. The log - rw - r - r -. 1 root root 10 m January 17 18:03 00000000000000077894 index - rw - r - r -. 1 127 m root root January 17 18:03 00000000000000077894.logCopy the code

In kafka’s design, offset is used as part of the filename.

The segment file is named after the maximum number of messages offset from the previous global segment. Values are up to 64 bits long, 20 digits long, and filled with zeros if no digits are present.

The index information can be used to quickly locate messages. By mapping all the INDEX metadata to memory, I/O operations on the segment File can be avoided.

Sparse storage of index files can greatly reduce the space occupied by index file metadata.

Sparse indexing: Indexes are created for data, but ranges are not created for each column, but for an interval; Benefits: The number of index values can be reduced. The downside: once the index interval is found, a second processing is required.

8. The physical structure of message

Every message a producer sends to Kafka is wrapped as a message by Kafka

The physical structure of message is shown below:

So the message sent by the producer to Kafka is not stored directly, but is wrapped by Kafka. Each message has the structure shown above. Only the last field is the actual message sent by the producer.

The data loss mechanism in Kafka

1. The production data of the producer is not lost

Sending mode

Producers send data to Kafka, either synchronously or asynchronously

Synchronization mode:

After sending a batch of data to Kafka, wait for Kafka to return the result:

  1. The producer waits 10 seconds, and if the broker does not respond with an ACK, it considers it a failure.
  2. The producer tries three times, and if there is no response, an error is reported.

Asynchronous mode:

Send a batch of data to Kafka, providing only a callback function:

  1. The data is first stored in the buffer on the producer side. The buffer size is 20,000 pieces.
  2. Data can be sent if one of the data thresholds or quantity thresholds is met.
  3. The size of a batch of data sent is 500.

Note: If the broker is slow to ack and the buffer is full, the developer can set whether to empty the buffer directly.

Ack mechanism (acknowledgement mechanism)

When producer data is sent out, the server needs to return an acknowledgement code, that is, ack response code. The ack response has three status values: 0,1, and -1

0: the producer only sends data and does not care whether the data is lost. The lost data needs to be sent again

1: The leader of the partition receives data. The status code of the response is 1, regardless of whether the data is synchronized

-1: All secondary nodes receive data and the status code of the response is -1

If the broker never returns an ACK state, the producer never knows whether he succeeded. The producer can set a timeout period of 10 seconds, after which the producer fails.

2. Data is not lost in the broker

The protection against data loss in the broker is mainly through copy factors (redundancy).

3. Consumer consumption data is not lost

When consumers consume data, as long as each consumer records the offset value, the data will not be lost. That is, we need to maintain our own offset, which can be saved in Redis.

Learn big data in five minutes and delve deeply into big data technology