What is Apache Kafka?
Apache Kafka is a publish-subscribe messaging system developed by Apache. It is a distributed, partitioned and repetitive logging service.
What is the traditional messaging method?
There are two traditional methods of messaging:
- Queuing: In a queue, a group of users can read messages from the server, and each message is sent to one of them.
- Publish-subscribe: In this model, messages are broadcast to all users.
What are the advantages of Kafka over traditional technologies?
- Fast: A single Kafka agent can handle thousands of clients, processing megabytes of read and write operations per second.
- Scaling: Partitioning and simplifying data across a group of machines to support larger data
- Persistence: Messages are persistent and replicated across the cluster to prevent data loss.
- Design: It provides fault tolerance and persistence
What is the meaning of a broker in Kafka?
It receives data from Producer, persists it, and provides it to consumers to subscribe to Kafka cluster nodes. There is no master-slave relationship between producers and ZooKeeper is used for coordination. A broker is responsible for reading and storing messages
What is a broker? What does it do?
A single Kafka server is a broker whose main job is to receive messages from producers, allocate offsets, and save them to disk. At the same time, requests from consumers and other brokers are received, processed and returned based on the type of request. In a typical production environment, a single broker has an exclusive access to a physical server
What is the maximum number of messages that Kafka server can receive?
The maximum size of messages that the Kafka server can receive is 1,000,000 bytes.
What is Kafka’s Zookeeper? Can we use Kafka without Zookeeper?
Zookeeper is an open source, high-performance coordination service for distributed applications of Kafka. Coordinate Kafka Broker, store original data: Consumer offset+ Broker information +topic information +partition information.
No, it is not possible to bypass Zookeeper and contact Kafka Broker directly. Once Zookeeper stops working, it cannot service client requests.
Zookeeper is used to communicate between nodes in a cluster
In Kafka, it is used to commit offsets, so if the node fails in any case, it can be retrieved from the previously committed offsets
In addition, it performs other activities such as leader detection, distributed synchronization, configuration management, identifying when a new node is leaving or connecting, clustering, node real-time status, and so on.
Explain how Kafka’s users consume information.
Passing messages in Kafka is done using the SendFile [zero copy] API. It supports transferring bytes from the socket to disk, saving copies through kernel space, and calling the kernel between kernel users. Zero copy: the user sends a command to the kernel, I want to manipulate the data, and then directly from the disk to the Socket Buffer, and then from the Socket Buffer to the nic Buffer, and then out.
Explain how to improve the throughput of remote users.
If the user is in a different data center than the broker, you may need to tune the socket buffer size to amortize long network latency.
Explain how you can get accurate information from Kafka during data creation.
In data, to get Kafka messages exactly, you have to follow two things: avoid duplication during data consumption and avoid duplication during data production.
There are two ways to get a semantics exactly when the data is generated
Each partition uses a separate writer, and whenever you find a network error, check the last message in that partition to see if your last write was successful
Include a primary key (UUID or otherwise) in the message and iterate among the user
Explain how to reduce disturbance in ISR. When does the broker leave the ISR?
The ISR is a set of copies of messages that are fully synchronized with Leaders, meaning that the ISR contains all committed messages. The ISR should always contain all copies until a real failure occurs. If a replica is detached from the Leader, it will be removed from the ISR.
Why does Kafka need to copy?
Kafka’s message replication ensures that any published messages are not lost and can be used against machine errors, program errors, or more commonly software updates.
What happens if the replica stays in the ISR for a long time?
If a replica remains in the ISR for a long period of time, it indicates that the tracker cannot fetch data as quickly as it collects data at the Leader.
14. What happens if the preferred replica is not in the ISR?
If the preferred replica is not in the ISR, the controller will not be able to transfer leadership to the preferred replica.
15. Is it possible to have message offset after production?
In most queuing systems, a class that acts as a producer cannot do this; its role is to trigger and forget the message. The broker does the rest of the work, such as proper metadata processing with ids, offsets, and so on. As a user of messages, you can be compensated by the Kafka Broker. If you look at the SimpleConsumer class, you’ll notice that it gets the MultiFetchResponse object that includes the offsets as a list. In addition, when you iterate over a Kafka message, you have a MessageAndOffset object that includes offsets and message sending.
16. Main features of Kafka
Kafka has nearly real-time message processing capability, facing massive data, efficient storage and query messages. Kafka supports batch read and write messages, and the message batch compression, improve network utilization and compression efficiency kafka supports message partition, each partition in the message to ensure sequential transmission, Kafka supports the creation of multiple copies for each partition. Only one leader copy is responsible for reading and writing, and the other copies are only responsible for synchronizing with the Leader copy. This method improves the disaster recovery capability of data. Kafka distributes leader replicas evenly across servers in a cluster to maximize performance
Examples of kafka application scenarios
- Log collection: A company can use Kafka to collect logs for a variety of services and open them to consumers, such as Hadoop, Hbase, and Solr, as a unified interface service
- Message systems: decouple producers and consumers, cache messages, and so on
- User activity tracking: Kafka is often used to record the activities of Web users or app users, such as browsing, searching, clicking, etc. These activities are published by various servers to Kafka topics, which subscribers subscribe to for real-time monitoring and analysis. Or load it into Hadoop or data warehouse for offline analysis and mining
- Operational metrics: Kafka is also used to record operational monitoring data. This includes collecting data for various distributed applications and producing centralized feedback for various operations, such as alarms and reports
- Streaming: spark Streaming and Storm
- The event source
Kafka theme partition
Each topic in Kafka can be divided into multiple partitions. Each partition has multiple replicas (replicas). The messages in each partition are different, which improves the concurrent read and write capability. The leader copy processes read and write requests, while the follower copy synchronizes messages with the leader copy. If the leader copy fails, the follower copy is re-elected to provide services. In this way, by increasing the number of partitions, you can achieve horizontal scaling, and by increasing the number of replicas, you can improve disaster resilience
19. How to implement consumer level extension
Kafka supports horizontal consumer scaling, which allows multiple consumers to join a consumer group. In a consumer group, each partition can be assigned to only one consumer. You can add new consumers to the consumer group to improve its consumption power. When a consumer in the consumer group fails to log off, the user will make the rebalance. The partitions it handles are assigned to other consumers
Order of messages
Kafka guarantees that messages are ordered within a partition, but does not guarantee that data is ordered between partitions. Each topic can be divided into multiple partitions. Different partitions within a topic contain different messages, and each message is assigned an offset when added to a partition. This is the unique number of the message in the partition. Kafka uses offset to ensure the order of messages in the partition. The offset order does not cross partitions
Kafka periodically removes stale messages to avoid disk overloading. What is the deletion strategy?
- One is based on how long the message is kept
- One is based on the size of the data stored in the topic
22. What is log compression
In many scenarios, the mapping between the key and the value of a message changes constantly. Consumers only care about the latest value of the key. In this case, you can enable Kafka log compression
Do multiple copies of the same partition contain the same messages?
Each copy contains the same message, but at the same time, the copies are not exactly the same
24. What is an ISR set? Who maintains it? How to maintain?
In-sync Replica (ISR) collection refers to the Replica collection that is currently available and whose message volume is similar to that of the leader. It is a subset of the entire Replica collection. The Replica of the ISR collection must meet the following requirements: The node where the Replica resides must maintain connection with ZooKeeper. The difference between the offset of the last message in the leader copy and the offset of the last message in the leader copy cannot exceed the specified threshold. The Leader copy of each partition maintains the ISR set of this partition. Write requests are first processed by the Leader copy. The follower copy then pulls the written messages from the leader copy. This process is delayed, resulting in slightly fewer messages stored in the follower copy than in the leader copy, which is tolerable as long as it does not exceed the threshold
25. What is Kafka designed for?
Kafka generalizes messages by topic. The program that publishes the message to Kafka topic becomes producer. The program that subscribes to topics and consumes the message becomes consumerKafka. Each service, called a BrokerProducers, sends messages over the network to a Kafka cluster, which provides messages to consumers
26. What are the three definitions of things that transmit data?
Data transfer transaction definition is usually has the following three levels: (1) a: most messages will not be repeated to send, most is transmitted at a time, but it is also possible that a no transmission: (2) at least one message will not be sending leakage, is transmitted at least once, but can be repeated transmission. (3) the precise time (Exactly once) : There are no missed transmissions or repeated transmissions, and every message is transmitted once and only once, as is expected
What are two conditions for Kafka to determine if a node is alive?
The node must be able to maintain the connection with ZooKeeper. ZooKeeper checks the connection to each node through the heartbeat mechanism
If the node is a follower, it must be able to synchronize the leader’s write operations in a timely manner without a long delay
Does the producer directly send data to the broker’s leader?
The producer sends data directly to the broker’s leader, without having to distribute it across multiple nodes. To help the producer do this, all Kafka nodes can tell which nodes are active and where the leader of the target topic is. This allows the producer to send messages directly to the destination
Kafa consumer can consume partition messages.
When a Kafa consumer consumes a message, it sends a “fetch” request to the broker to consume a message for a specific partition. The consumer specifies the message’s offset in the log. The consumer can consume messages from this position. It makes sense to be able to roll back and re-consume previous messages
Do Kafka messages use Pull mode or Push mode?
The question Kafka initially considered was whether customer should pull the message from brokes or brokers should push the message to the consumer, i.e. pull or push. In this respect, Kafka follows a traditional design common to most messaging systems: Some messaging systems, such as Scribe and Apache Flume, use a push mode to push messages downstream to consumers. This has both advantages and disadvantages: It is up to the broker to determine the rate at which the message is pushed, and it is not easy to handle consumers with different consumption rates. Messaging systems are designed to allow consumers to consume messages as quickly as possible, but unfortunately, in push mode, when the rate at which the broker pushes a message is much faster than the rate at which the consumer consumes it, the consumer crashes. Another advantage of pull is that consumers can decide whether to pull data from the broker in bulk. The Push pattern must decide whether to Push each message immediately or in batches after caching without knowing the consumption power and strategy of downstream consumers. If you push at a lower rate to avoid a consumer crash, you might waste pushing fewer messages at a time. A disadvantage of Pull is that if the broker has no messages available for consumption, it will cause consumers to poll in a loop until new messages arrive. To avoid this, Kafka has a parameter that lets the consumer block know when a new message arrives (or until the number of messages reaches a certain number so that they can be sent in batches)
31. What is the message format Kafka stores on hard disk?
A message consists of a fixed-length header and a variable-length byte array. The header contains a version number and a CRC32 check code. Message length: 4 bytes (value: 1+4+ N) Version: 1 byteCRC Verification code: 4 bytes Specific message: N bytes
Kafka efficient file storage design features:
Kafka divides a parition large file into several small files in a topic. By using these small files, it is easy to periodically clean or delete consumed files and reduce disk usage.
- Indexing information allows you to quickly locate messages and determine the maximum size of a response.
- Index metadata can be mapped to memory to avoid I/O operations on segment files.
- Sparse storage of index files can greatly reduce the space occupied by index file metadata.
33. There are three key differences between Kafka and traditional messaging systems
Kafka is a distributed system: it runs as a cluster, is flexible and scalable, internally replicates data to improve fault tolerance and high availability. Kafka supports real-time streaming processing
How does Kafka create a Topic and place partitions into different brokers
Copy factors cannot be greater than the number of brokers. The first copy placement for the first partition (numbered 0) is randomly selected from brokerList; The first copy of the other partitions is moved backwards relative to the 0th partition. That is, if we have five brokers and five partitions, assuming the first partition is placed on the fourth Broker, the second partition will be placed on the fifth Broker. The third partition will be placed on the first Broker; A fourth partition will be placed on the second Broker, and so on; The remaining copy position relative to the first copy is actually determined by nextReplicaShift, and this number is also randomly generated
Where will Kafka create a partition
Before starting the Kafka cluster, you need to configure the log.dirs parameter, which is the directory where Kafka data is stored. This parameter can be configured as multiple directories, separated by commas, usually distributed on different disks to improve read and write performance. Of course, we can also configure the log.dir parameter, meaning the same. You only need to set one of them. If only one directory is configured for the log.dirs parameter, partitions allocated to brokers must create folders to store data only in this directory. But if the log.dirs parameter is configured with multiple directories, in which folder does Kafka create the partition directory? Kafka creates a new partition directory (Topic name + partition ID) in the folder that contains the fewest partitions. Note that it is the directory with the least total number of partitioned folders, not the directory with the least disk usage! That is, if you add a new disk to the log.dirs parameter, the new partition directory must be created on the new disk until the new partition directory has the least number of partitions.
How to save data on a partition to a disk
Multiple partitions within a topic are stored as folders to the broker, with each partition numbered in increments of 0 and with multiple segments (xxx.index, The size of the segment file is the same as the size of the configuration file. It can be modified as required. Default: 1g If the size is greater than 1g, a new segment will be rolled and the offset of the last message from the previous segment will be named
Kafka ack mechanism
Request. Required. Acks has three values: 0, 1, and -1:0: The producer does not wait for the broker to ack. This has the lowest latency but the weakest storage guarantee. The server will wait for the ACK value of the leader copy to confirm receipt of the message and send the ACK. However, if the leader hangs up and does not ensure whether the replication is complete, data will be lost. -1: Similarly, on the basis of 1, the server will wait for all copies of followers to receive data before receiving an ACK from the leader, so that data will not be lost
38. How do Kafka consumers consume data
Every time a consumer consumes data, the consumer will record the position of the physical offset of the consumption. When the consumer consumes data next time, he will continue to consume at the position of the last time
Consumer load balancing strategy
A shard in a consumer group corresponds to a consumer member, which ensures that each consumer member can access it. If there are too many members in the group, there will be idle members
40. Data is ordered
In a consumer group there’s an ordered consumer group and there’s disorder between the consumer groups
41. Data grouping policy for kafaka data production
The producer decides which partition in the cluster the data is generated into and every message is in the key and value format. The key is sent by the producer and the producer determines which partition in the cluster the data is generated into