Hi, I’m Kafka. Many of you have probably heard of me and know that I was born on LinkedIn in 2011. Since then, I have become more and more powerful. Being an excellent and complete platform, you can store huge amounts of data redundantly on me, and I have a message bus with high throughput (millions per second) on which you can stream data through me in real time. If you think that’s all I have, then you’re really shallow.

That’s fine, but it doesn’t get to the heart of what I’m talking about. I’ll give you a few key words: distributed, horizontally scalable, fault-tolerant, commit logging.

I’ll explain what these abstract words mean and tell you how I work.

Internal monologue: ORIGINALLY, I wanted to write this article in the first person, but I found that I could only write the above, and I couldn’t contain any more, so I decided not to embarrass myself and write in the third person.

distributed

A distributed system consists of a number of running computer systems, all of which work together in a cluster as a single node to the end user.

Kafka is also distributed because it stores, receives, and sends messages on different nodes (also known as brokers), which has the advantage of high scalability and fault tolerance.

Horizontal scalability

Before that, let’s look at what vertical scaling is. For example, if you have a traditional database server and it starts to get overloaded, the solution to this problem is to configure the server (CPU, memory, SSD). This is called vertical scaling. But there are two big disadvantages to this approach

  1. There are hardware limitations and it is impossible to add machine configurations indefinitely
  2. It requires downtime, which many companies can’t tolerate

Horizontal scalability is to solve the same problem by adding more machines without downtime, and there is no limit to the number of machines in the cluster. The problem is that not all systems support horizontal scalability because they are not designed for clustering (where work is more complex).

Fault tolerance

The most lethal problem in a non-distributed system is single point of failure. If your only server fails, I believe you will crash.

Distributed systems are designed in such a way that they can be configured to tolerate failure. In a five-node Kafka cluster, you can still work even if two of the nodes fail. It’s important to note that fault tolerance is directly related to performance, and the more fault tolerant your system is, the worse it will be.

Commit log

Commit logs (also known as write-ahead logs or transaction logs) are additional persistent ordered data structures that support only, you cannot modify or delete records, they read from left to right and keep the logs in order.Think Kafka’s data structure is so simple?

Yes, in many ways, this data structure is at the heart of Kafka. The records of this data structure are ordered, and ordered data ensures our processing flow. Both of these are extremely important issues in distributed systems.

Kafka actually stores all messages to disk and sorts them in a data structure for reading with sequential disk.

  1. Both reads and writes are constant time O(1)(when the record ID is determined), which is a huge advantage over O(log N) operations on other structures on disk because each disk search is time-consuming.
  2. Reads and writes do not interact, writes do not lock reads, and vice versa.

These two have huge advantages because data size and performance are completely separated. Kafka has the same performance whether you have 100KB or 100TB of data on your server

How to work

Producers send records to Kafka brokers. These messages are processed by other consumers. The messages are stored in topics that consumers subscribe to receive new messages. Doesn’t it feel a lot like your usual code — the producer-consumer model?

As topics become very large, they are divided into smaller partitions for better performance and scalability (such as storing messages sent by users to each other, which you can split based on the first letter of the user name). Kafka ensures that all messages within a partition are sorted in their order. The way to distinguish a particular message is by its offsets, which you can think of as a normal array index, the sequence number that increments each new message in the partition.

Kafka obeys the rules of stupid brokers and smart consumers. This means that Kafka does not track which records consumers read and delete, but instead stores them for a certain amount of time (say, 1 day, starting with log.Retention to determine log retention) until a certain threshold is reached. Consumers themselves poll Kafka for new messages and tell it which records they want to read. This allows them to increment/decrement the offset they are at as much as they want to be able to replay and reprocess events.

It is important to note that consumers belong to a consumer group, and a consumer group has one or more consumers. To avoid two processes reading the same message twice, each partition can only be accessed by one consumer in a consumer group.

Persist to hard disk

As I mentioned earlier, Kafka actually stores all records on hard disk and does not store anything in RAM. You want to know how this one makes this choice, but there are a lot of optimizations behind it that make it work.

  1. Kafka has a protocol for grouping messages, which allows network requests to group messages together and reduces network overhead. The server in turn retains a large number of messages at once, and the consumer fetches a large number of linear blocks at once.
  2. Linear reads and writes on disk are very fast. The concept of modern disks being very slow is due to a lot of disk addressing, but is not a problem in a lot of linear operations.
  3. The operating system is heavily optimized for linear operations through prefetch (prefetching large chunks multiple times) and postwrite (composing small logical writes into large physical writes) techniques.
  4. The operating system caches disk files in free RAM. This is called Pagecache, and Both kafka reads and writes make heavy use of Pagecahce
    1. Messages are written from Java to page cache and then flushed from page cache to disk by the asynchronous thread
    2. If a message is read from the page cache, it is loaded directly into the socket. If a message is not read from the disk, it is loaded into the page cache and sent directly from the socket
  5. Because Kafka stores messages in an unmodified, standardized binary format throughout the process (producer – >broker – > Consumer), it can use zero-copy optimization. The operating system then copies data directly from Pagecache to the socket, effectively bypassing the Kafka Broker entirely.

All of these optimizations enable Kafka to deliver messages at near-network speeds.

Data distribution and replication

Let’s talk about how Kafka is fault-tolerant and how it distributes data between nodes.

Partition data is replicated across multiple brokers in order to preserve data when a Borker fails.

At any time, a broker has a partition through which applications read/write, called —- Partition Leader. It copies the data it receives to N other brokers, called followers, who also store data and are ready to compete to become the leader if the leader node dies.

This ensures that your successfully published messages are not lost, and by choosing to change replicators, you can exchange performance based on the importance of the data for greater durabilityBut you might ask: How do producers or consumers know who the partition leader is?

For producer/consumer write/read requests to a partition, they need to know which leader of the partition is, right? This information is definitely available, and Kafka uses ZooKeeper to store this metadata.

What is a ZooKeeper

Zookeeper is a distributed key-value store. It is highly optimized for reading, but writes are slow. It is most commonly used to store metadata and as a mechanism for processing clusters (heartbeats, distribution of updates/configurations, etc.).

It allows the client of the service (Kafka Broker) to subscribe and send it to them when changes occur, which is how Kafka knows when to switch partition leaders. ZooKeeper maintains a cluster itself, so it’s very fault tolerant, and it should be, because Kafka is heavily dependent on it.

Zookeeper stores all metadata, including but not limited to the following:

  • Offsets for each partition of the consumer group (now clients store offsets on separate Kafka topics)
  • ACL – Permission control
  • Producer/consumer flow control – the amount of data produced/consumed per second. See Kafka- Flow control Quota feature
  • Partition leaders and their health information

How does produer/ Consumer know who is the partition leader?

Producers and consumers used to connect directly to ZooKeeper for this information, but Kafka has removed this strong coupling since versions 0.8 and 0.9. The client gets this metadata directly from the Kafka Broker, and the Kafka Broker gets this metadata from ZooKeeper.

What is ZooKeeper?

Streaming (Streaming)

In Kafka, a stream processor takes a continuous stream of data from an input topic, performs some processing on that input, and generates a stream of data for output to other topics (or external services, databases, containers, and so on).

What is data flow? First, a data flow is an abstract representation of an unbounded data set. Borderless means infinite and continuous growth. An unbounded dataset is infinite because new records are added over time. Events such as credit card transactions and stock transactions can be used to represent data flows

We can use the Producer/Consumer API for simple processing directly, but for more complex transformations such as connecting streams together, Kafka provides an integrated Stream API library

This API is used in your own code, it does not run on the broker, and it works like the Consumer API to help you extend flow processing across multiple applications (similar to consumer groups).

Stateless processing

Stateless processing of a stream is deterministic processing that does not depend on any external conditions and will always produce the same output for any given data regardless of anything else. For example, we’re going to do a simple data conversion —-“zhangsan” –> “Hello,zhangsan”

Stream – table ambiguity

It is important to realize that flows and tables are essentially the same, and that flows can be interpreted as tables, and tables can be interpreted as flows.

Flow as a table

Streams can be interpreted as a series of updates to the data, and the aggregated result is the final result of the table, a technique known as Event Sourcing

If you are familiar with database backup synchronization, you know that their technical implementation is called streaming replication —- which sends a report copy server for every change to the table. Examples include AOF in Redis and binlog in Mysql

Kafka flows can be interpreted in the same way – events that occur when the accumulation forms a final state. Such stream aggregation is stored in local RocksDB (by default) and is called KTable.

Watch as a stream

You can think of a table as a snapshot of the latest value of each key in the flow. Just as streaming records can generate tables, table updates can generate change log flows.

Stateful processing

Common Java operations such as map() or filter() are stateless and do not require you to retain any raw data. But in reality, most operations are stateful (such as count()) because you need to store the current cumulative state.

The problem with maintaining state on a stream processor is that the stream processor can fail! Where do you need that to be fault-tolerant?

One simple approach is to simply store all the state in a remote database and connect to that store over the network. The problem with this is that a lot of network bandwidth can slow down your application. A more subtle but important issue is that the uptime of your flow processing job will be tightly coupled to the remote database, and the job will not be self-contained (other team changes to the database could break your processing).

So what’s a better way? Recall the duality of tables and flows. This allows us to transform the stream into a table in the same location as our processing. It also gives us a mechanism to handle fault tolerance by storing streams in Kafka Broker.

The stream processor can keep its state in the local table (for example, RocksDB), which will be updated from the input stream (possibly after some arbitrary transformation). When a process fails, it can recover its data by rerunning.

You can even use the remote database as a producer of the flow, effectively broadcasting the change logs used to rebuild the tables locally.

KSQL

Typically, we have to write stream processing in the JVM language, as this is the only official Kafka Streams API client. In April 2018, KSQL was released as a new feature that allows you to write simple Stream Jobs in familiar SQL-like languages. You have a KSQL server installed and use the CLI to query and manage interactively. It uses the same abstraction (KStream and KTable), guarantees the same benefits of the Streams API (scalability, fault tolerance), and greatly simplifies the work of Streams.

This may not sound like a lot, but in practice it can be more useful for testing content and even allowing people outside of development (such as product owners) to use stream processing. Check out this article on using KSQL provided by Confluent

When to use Kafka

As we’ve already shown, Kafka allows you to take a large number of messages over a centralized medium and store them without worrying about performance or data loss. This means that it is ideally suited for use as the core of a system architecture, serving as a centralized medium for connecting different applications. Kafka can be a central part of an event-driven architecture, allowing you to really separate applications from one another.Kafka allows you to easily separate communications between different (micro) services. Using the Streams API, it is now easier than ever to write business logic to enrich Kafka topic data for use by services. The possibilities are great, and I urge you to explore how companies are using Kafka.

conclusion

Apache Kafka is a distributed streaming platform capable of handling trillions of events per day. Kafka provides a low-latency, high-throughput, fault-tolerant publish and subscribe pipeline and the ability to handle event streams. We reviewed its basic semantics (producer, agent, consumer, topic), learned about some of its optimizations (Pagecache), learned about its fault tolerance by copying data, and introduced its increasingly powerful streaming capabilities. Kafka has been widely adopted by thousands of companies around the world, including a third of fortune 500 companies. With Kafka under active development and the recent release of its first major release 1.0(November 1, 2017), it is predicted that the streaming platform will become as important a core data platform as relational databases. I hope this introduction has helped you get familiar with Apache Kafka.

Refer to the article

Tech.meituan.com/2015/01/13/… Shiyueqi. Making. IO / 2017/04/27 /… Kafka.apache.org/documentati… docs.confluent.io/current/