KoP officially open source: Supports the native Kafka protocol on Apache Pulsar

We are pleased to announce that StreamNative and OVHcloud open source “KoP” (Kafka on Pulsar). The KoP introduces the Kafka protocol processing plug-in into the Pulsar Broker. As a result, Apache Pulsar supports the native Apache Kafka protocol. By adding the KoP protocol processing plug-in to an existing Pulsar cluster, users can migrate existing Kafka applications and services to Pulsar without changing the code. In this way, Kafka applications can take advantage of Pulsar’s powerful features, such as:

Simplify operations with enterprise-level multi-tenancy features.
Avoid data migration and simplify operations.
Persisting event flows with Apache BookKeeper and tiered storage.
Using Pulsar Functions for serverless event processing.

What is Apache Pulsar

Apache Pulsar is an event flow platform. Apache Pulsar was originally built on a cloud-native, layered, sharding architecture. This architecture separates services from storage, making the system more container-friendly. Pulsar’s cloud-native architecture is highly scalable, consistent and resilient, enabling companies to scale their business with real-time data solutions. Pulsar has been widely adopted since it became open source in 2016 and became an Apache top project in 2018.

The desire for the KoP

Plusar provides a unified messaging model for queue and flow workloads. Pulsar supports its own Protobuf-based binary protocol to ensure high performance and low latency. Protobuf is good for implementing the Pulsar client. In addition, the project also supports Java, Go, Python, and C ++ languages, as well as third-party clients provided by the community. However, applications written using other messaging protocols must be rewritten by the user or they will not be able to adopt Pulsar’s new unified messaging protocol.

To address this, the Pulsar community has developed applications to migrate Kafka applications from other messaging systems to Pulsar. For example, Pulsar provides a Kafka Wrapper on top of the Kafka Java API. The Kafka Wrapper allows users to switch their Kafka Java client application from Kafka to Pulsar without changing the code. Pulsar also offers a rich connector ecosystem for connecting Pulsar to other data systems. However, there is still strong demand from users who want to switch to Pulsar from other Kafka applications.

StreamNative and OVHcloud partnership

StreamNative receives a large number of inbound requests for assistance in migrating from other messaging systems to Pulsar. At the same time, StreamNative recognized the need to natively support other messaging protocols (such as AMQP and Kafka) on Pulsar. As a result, StreamNative began working on bringing a generic protocol processing plug-in framework into Pulsar. The framework allows developers who use other messaging protocols to use Pulsar.

OVHcloud has been using Apache Kafka for years. Although they had experience running multiple clusters on Kafka and processing millions of messages per second, they faced formidable operational challenges. For example, without the use of multi-tenancy features, it would be difficult for them to put thousands of topics for thousands of users into a cluster.

So instead of Kafka, OVHcloud decided to move its theme-as-a-service product (ioStream) to Pulsar and build its product on top of It. Compared to Kafka, Pulsar supports multi-tenant features and its overall architecture includes the Apache BookKeep component, which helps simplify user operations.

After initial experiments, OVHcloud decided to use the KoP as a PoC proxy to instantly convert Kafka protocols to Pulsar. In the process, OVHcloud noticed that StreamNative was working on bringing the Kafka protocol native to Pulsar. So they teamed up to develop the KoP.

The KoP is designed to take advantage of Pulsar and BookKeeper’s event stream storage architecture and Pulsar’s pluggable protocol processing plug-in framework to provide a streamlined yet comprehensive solution. KoP is a protocol processing plug-in named “Kafka”. The KoP is bound to and runs with the Pulsar Broker.

Distributed log

Regarding logging, Both Pulsar and Kafka have very similar data models for publishing/subscrienting messages and event flows. For example, Both Pulsar and Kafka use distributed logging. The main difference between the two systems is how they implement distributed logging. Kafka uses a partitioned architecture, with distributed logs (logs in a Kafka partition) stored in a set of brokers. Pulsar adopts a sharding architecture and uses Apache BookKeeper as its scale-out sharding storage layer to store distributed logs in Apache BookKeeper. Pulsar’s sharding architecture helps to avoid data migration, achieve high scalability, and store event streams persistently. For more information on the main differences between Pulsar and Kafka, see the Splunk blog and the BookKeeper Project blog.

Both Pulsar and Kafka are built on a similar data model (distributed logging), and since Pulsar uses a distributed log storage and pluggable protocol processing plug-in framework (introduced in 2.5.0), it is easy to implement a kafka-compatible protocol processing plug-in.

implementation

By comparing Pulsar and Kafka, we see a lot of similarities between the two systems. Both systems include the following operations:

The Topic to find: All clients connect to any broker to find metadata for the Topic (that is, the owner broker). After obtaining the metadata, the client establishes a persistent TCP connection with the Owner Broker.
release: Client and TopicownerBroker to append messages toDistributed logIn the.
consumption: of the client and Topic partitionownerBroker talks to read messages from the distributed log.
The offset: Assignment for messages published to a Topic partitionThe offset. In Pulsar, the offset is calledMessageId. Consumer can useThe offsetTo find a given location in the log so that messages can be read.
Consumption statusBoth systems maintain the consumption state of the consumer (Kafka calls the consumer group) in the subscription. Kafka stores the consumption state in__offsetsTopic, and Pulsar stores the consumption state incursors.

As you can see, these are all the raw operations provided by scale-out distributed log stores such as Apache BookKeeper. The core functionality of Pulsar is implemented on Apache BookKeeper. As a result, we can implement the Kafka concept very simply and directly using existing components that Pulsar developed on BookKeeper.

The following figure illustrates how we added Kafka protocol support to Pulsar. We introduce a new protocol processing plug-in that utilizes Pulsar’s existing components (such as Topic discovery, Distributed Logbook-ManagedLedger, Cursor, etc.) to implement the Kafka transport protocol.

Topic

Kafka stores all topics in a flat namespace. However, Pulsar stores topics in a hierarchical, multi-tenant namespace. We added the Kafka Amespace configuration to the Broker configuration so that the administrator can map the Kafka Topic to the Pulsar Topic. To make it easier for Kafka users to use Apache Pulsar’s multi-tenant features, Kafka users can specify a Pulsar tenant and namespace as their SASL user name when using the SASL authentication mechanism to authenticate Kafka clients.

Message ID and offset

Kafka specifies an offset for each message that is successfully published to a Topic partition. Pulsar specifies a MessageID for each message. The message ID consists of ledger-ID, entry-id, and batch-index. We use the same method in the Pulsar-Kafka wrapper to convert the Pulsar message ID to an offset and vice versa.

The message

Both Kafka and Pulsar messages contain keys, values, timestamps, and headers (called ‘properties’ in Pulsar). We automatically convert these fields between Kafka messages and Pulsar messages.

The Topic to find

We provide the same Topic lookup method for Kafka and Pulsar’s request handling plug-ins. The request processing plug-in finds the Topic, finds the full ownership of the requested Topic partition, and returns the Kafka TopicMetadata containing the ownership information to the Kafka client.

news

When a Kafka client publishes a message, the Kafka request processing plug-in transforms a Kafka message into a Pulsar message by mapping multiple fields (such as key, value, timestamp, and HEADERS) one by one. Meanwhile, the Kafka request processing plug-in uses the ManagedLedger Append API to store these converted Pulsar messages in BookKeeper. Once the Kafka request processing plug-in converts Kafka messages into Pulsar messages, existing Pulsar applications can receive messages published by Kafka clients.

News consumption

When a Consumer request is received from a Kafka client, the Kafka request processing plug-in opens a non-persistent cursor and then reads entries from the requested offset. Once the Kafka request processing plug-in converts Pulsar messages back into Kafka messages, existing Kafka applications can receive messages published by Pulsar clients.

Group Coordinator & Manages the offset

The biggest challenge is to implement group coordinators and offset management. Pulsar does not support a centralized group coordinator. Therefore, it cannot allocate partitions for consumers ina consumer group or manage the offset of each consumer group. The Pulsar Broker manages partition allocations based on the partitions, and the owner broker of the partitions manages the offsets by storing acknowledgements in the cursors.

It is difficult to make the Pulsar model consistent with the Kafka model. Therefore, to be fully compatible with Kafka clients, we store coordinator group changes and offsets ina Pulsar system Topic named public/ Kafka /__offsets, The Kafka Coordinator group is implemented. In this way, we are able to bridge the gap between Pulsar and Kafka and allow users to use existing Pulsar tools and policies to manage subscriptions and monitor Kafka Consumers. We add a background thread to the implemented Coordinator group to periodically synchronize the offset updates from the system Topic to the Pulsar cursor. Therefore, actually Kafka consumer groups are considered Pulsar subscriptions. All existing Pulsar tools can also be used to manage Kafka consumer groups.

Connect two popular messaging ecosystems

StreamNative and OVHcloud both value customer success. We believe that providing the native Kafka protocol on Apache Pulsar will help users who adopt Pulsar achieve business success faster. KoP integrates two popular event flow ecosystems, unlocking new use cases. Customers can take advantage of both ecosystems by building a truly unified event flow platform with Apache Pulsar to accelerate the development of real-time applications and services.

The KoP enables the log collector to continue collecting log data from its source and publish messages to Apache Pulsar using the existing Kafka integration. Downstream applications can use Pulsar Functions to process events arriving at the system, enabling serverless event streaming.

The trial KoP

The KoP uses Apache License V2. The project address is github.com/streamnativ… The KoP has been built into the Platform. You can download the StreamNative Platform to try out all of the KoP’s features. If you are already running a Pulsar cluster and want it to support Kafka, you can install the KoP protocol processing plug-in into your existing Pulsar cluster. For more information, refer to the instructions.

If you want to learn more about the KoP, refer to the code and documentation for the KoP. We look forward to your questions and PR. You can also join the # kop channel in Pulsar Slack to discuss all things Kafka-on-pulsar.

StreamNative and OVHcloud will host webinars on the KoP on March 31. For more details, click Register. Look forward to seeing you online.

Thank you

Initially, StreamNative initiated the KoP project. Later, the OVHcloud team joined the project. We worked together on the KoP project. Many thanks to Pierre Zemb and Steven Le Roux of OVHcloud for their contribution to this project!