This is the 9th day of my participation in the August More Text Challenge. For details, see:August is more challenging

An overview of the

Originally developed by Linkedin, Kafka is a distributed, partitioned, multi-replica, multi-subscriber, ZooKeeper-coordinated distributed logging system (also known as MQ) that is commonly used for Web/Nginx logs, access logs, messaging services, etc. Linkedin was donated to the Apache Foundation in 2010 and became a top open source project.

The application scenarios are as follows: log collection system and message system.

Kafka’s main design goals are as follows:

  • The message persistence capability is provided with the time complexity of O(1), and the performance of constant time access can be guaranteed even for more than TB data.
  • High throughput. Even very cheap business machines can support 100K messages per second on a single machine.
  • Supports the partitioning of messages between Kafka servers and distributed consumption, while ensuring the sequential transmission of messages within each partition.
  • Support both offline data processing and real-time data processing.
  • Scale out: Supports online Scale out

Kafka cluster architecture

The serial number Components and Instructions
1 BrokerA Kafka cluster is typically composed of multiple agents to maintain load balancing. Kafka agents are stateless, so they use ZooKeeper to maintain their cluster state. A Single Kafka Broker instance can process hundreds of thousands of reads and writes per second, and each Broker can process terabytes of messages with no performance impact. The Kafka broker leadership election can be done by ZooKeeper.
2 ZooKeeperZooKeeper is used to manage and coordinate Kafka agents. The ZooKeeper service is primarily used to notify producers and consumers of any new agents in the Kafka system or of agent failures in the Kafka system. Zookeeper receives notifications about the presence or failure of an agent, and then producers and consumers take decisions and start coordinating their tasks with some of the other agents.
3 Producers(producers) The producer pushes data to the broker. When a new agent starts, all producers search for it and automatically send messages to the new agent. The Kafka producer does not wait for acknowledgement from the agent and sends the message as fast as the agent can handle it.
4 ConsumersBecause Kafka agents are stateless, this means that consumers must maintain how many messages have been consumed by using partition offsetting. If the consumer confirms a particular message offset, it means that the consumer has consumed all previous messages. The consumer makes an asynchronous pull request to the broker to have a byte buffer ready for consumption. Consumers can simply retrace or jump to any point in the partition by providing an offset value. The consumer offset value is notified by ZooKeeper.

The infrastructure

If you’re confused by this picture, don’t worry! Let’s analyze the concepts first

Producer: Producer is the Producer, the Producer of information and the entry point of information.

Kafka cluster:

Broker: A Broker is an instance of Kafka. Each server has one or more instances of Kafka. Let’s assume that each Broker is a server. Each broker in a Kafka cluster has a unique number, such as broker-0, broker-1, etc.

Topic: The Topic of a message, which can be understood as a category of messages, and where Kafka’s data is stored. Multiple topics can be created on each broker.

Partition: A Topic Partition. Each Topic can have multiple partitions. Partitions are used to load and improve kafka’s throughput. The data of the same topic in different partitions is not repeated, partition form is a folder!

Replication: Each partition has multiple replicas, and the replicas are used as backup disks. When the primary partition (Leader) fails, a spare partition (Follower) will be selected to become the Leader. The default maximum number of replicas in Kafka is 10. The number of replicas should not be greater than the number of brokers. Followers and leaders must be on different machines, and a machine can only store one copy (including itself) for a partition.

Message: Indicates the body of each sent Message.

The Consumer is the one who gets the message.

Consumer Group: We can combine multiple Consumer groups into a Consumer Group. In Kafka’s design, data in a partition can only be consumed by one Consumer in a Consumer Group. Consumers in the same consumer group can consume data from different partitions of the same topic to improve kafka’s throughput!

Zookeeper: The Kafka cluster relies on Zookeeper to store cluster meta information to ensure system availability.

Kafka and zookeeper

Kafka relies heavily on the ZooKeeper cluster (so the previous ZooKeeper article is somewhat useful). All brokers register with ZooKeeper at startup to elect a controller. The election process is a straightforward one, with no algorithms involved.

It listens to multiple directories in ZooKeeper. For example, if there is a directory /brokers/, other nodes register with it and use their ids. Such as/brokers / 0

When registering each node is bound to expose their own host name, port number, and so on information, at this time the controller is registered to read up from the node’s data (through the surveillance mechanism), to generate cluster metadata information, after the information were distributed to other servers, let the other server can perceive to other members of the cluster.

In this scenario, we create a topic, kafka will generate the partitioning scheme in this directory, the controller will listen to this change, and it will synchronize the meta information in this directory. Then it is also delegated to its slave nodes, so that the entire cluster is informed of the partitioning scheme, and the slave nodes create their own directories and wait for the partition copy to be created. This is also the management mechanism of the entire cluster.

Kafka features

(1) High throughput, low latency: Kafka can process hundreds of thousands of messages per second, its latency is only a few milliseconds, each topic can be divided into multiple partitions, consumer groups consume partition operation;

(2) Scalability: Kafka clusters support hot scaling;

(3) Persistence and reliability: messages are persisted to the local disk and data backup is supported to prevent data loss;

(4) Fault tolerance: nodes in the cluster are allowed to fail (if the number of copies is n, n-1 nodes are allowed to fail);

(5) High concurrency: support thousands of clients to read and write at the same time;

(6) Support real-time online processing and offline processing: Real-time stream processing system like Storm can be used to process messages in real time, while batch processing system like Hadoop can also be used for offline processing;

Use scenarios for Kafka

(1) Log collection: a company can use Kafka to collect the log of various services, through Kafka in a unified interface service open to a variety of consumers, such as Hadoop, Hbase, Solr, etc.;

(2) Message system: decoupling and producer and consumer, caching messages, etc.;

(3) User activity tracking: Kafka is often used to record web users or app users of a variety of activities, such as browsing the web, search, click and other activities, these activities are published by various servers to Kafka’s topic, and then subscribers by subscribing to these topics to do real-time monitoring analysis, Or load into Hadoop, data warehouse for offline analysis and mining;

(4) Operational metrics: Kafka is also often used to record operational monitoring data. This includes collecting data for a variety of distributed applications and producing centralized feedback on various operations, such as alarms and reports;

Spark Streaming and Storm;

(6) Event source;