More and more messaging platforms are adopting real-time streaming technology, which is driving the adoption and growth of Pulsar. In 2020, the visibility and usage of Pulsar increased significantly. Pulsar has been inspired by the interest of companies developing messaging platforms and event-streaming applications, from Fortune 100 companies to forward-thinking startups, and the ecosystem around the project has grown rapidly, with recent media coverage.
Recent news and blog posts have provided an objective introduction to Pulsar, giving readers a clear understanding of Pulsar’s performance and use cases. Verizon Media, Iterable, Nutanix, Overstock.com and others have also recently released use cases for Pulsar and shared a series of ideas on how to achieve business goals with Pulsar.
However, the media information is not completely true and accurate. In addition, we’ve been asked by members of the Pulsar community to respond to a recent Confluent technical post on Kafka, Pulsar, and RabbitMQ. It is fortunate that Pulsar has been able to evolve so quickly and become such a revolutionary technology, and we would like to take this opportunity to explore Pulsar’s capabilities.
This article will provide in-depth information on Pulsar technology, community and ecology to objectively and comprehensively present the overall picture of the event flow. This article, the first in a two-part series, compares Pulsar and Kafka in terms of performance, architecture, and features. The next part focuses on the use cases, support, and community of Pulsar.
Note that since Kafka’s available documentation is more comprehensive and widely known, this article will focus on the relatively basic and less documented aspects of Pulsar so far.
General situation of
component
Pulsar consists of three main components:broker,Apache BookKeeper 和 Apache ZooKeeper. The Broker is a stateless service that clients need to connect to for core messaging. BookKeeper and ZooKeeper are stateful services. The BookKeeper node (Bookie) stores messages and cursors, while ZooKeeper is used only to store metadata for the broker and Bookie. In addition, BookKeeper uses RocksDB as an embedded database for storing internal indexes, but RocksDB’s management is not independent from BookKeeper.Kafka uses a monolithic architecture model that combines service and storage, while Pulsar doesMulti-tier architectureCan be managed in a separate layer. Brokers in Pulsar perform calculations on one layer, while Bookie manages stateful storage on another layer.
Pulsar’s multi-tier architecture may seem more complex than Kafka’s monolithic architecture, but it’s not that simple. Architectural design requires trade-offs, and BookKeeper makes Pulsar more scalable, less cumbersome, faster, and more consistent. We will discuss these points in detail later in the article.
Storage architecture
Pulsar’s multi-tier architecture affects how it stores data. Pulsar divides topic partitions into fragments, which are then stored on Apache BookKeeper storage nodes to improve performance, scalability, and availability.Pulsar’s infinitely distributed log is sharding centric, implemented with extended log storage (via Apache BookKeeper) and built-in tiered storage support, so sharding can be evenly distributed across storage nodes. Since the data associated with any given topic is not tied to a specific storage node, it is easy to replace or expand the storage node. In addition, the smallest or slowest node in the cluster is not a short board for storage or bandwidth.
Pulsar’s architecture is non-partitioned and non-rebalanced, ensuring timely scalability and high availability. These two important features make Pulsar especially useful for building mission-critical services such as billing platforms for financial use cases, transaction processing systems for e-commerce and retailers, and real-time risk control systems for financial institutions.
By taking advantage of the powerful Netty architecture, the transfer of data from producers to broker to Bookie is zero copy, and no copy is generated. This feature is very friendly to all streaming use cases because the data is transferred directly over the network or disk without any performance penalty.
News consumption
Pulsar’s consumption model uses a flow pull approach. Flow pull is an improved version of long polling that not only enables zero wait between a single invocation and a request, but also provides a two-way message flow. With the stream-pull model, Pulsar achieves a lower end-to-end latency than all existing long-polling messaging systems, such as Kafka.
Using a simple
Simple operations
When evaluating the ease of operation of a particular technology, consider not only initial setup, but also long-term maintenance and scalability. Here are some things to consider:
- To keep up with the growth of services, is it quick and easy to scale the cluster?
- Is the cluster available to multiple tenants (corresponding to multiple teams and users) out of the box?
- Do O&M tasks (such as replacing hardware) affect service availability and reliability?
- Is it easy to replicate data for geographic redundancy or different access patterns?
Long-term Kafka users will find none of these questions easy to answer when operating Kafka. Most of these tasks require tools other than Kafka, such as cruise Control for managing cluster rebalancing and Kafka mirror-Maker for replication requirements.
Because Kafka is difficult to share across teams, many organizations have developed tools to support and manage multiple different clusters. These tools are critical to the successful large-scale use of Kafka, but they also add to Kafka’s complexity. The best tools for managing Kafka clusters are commercial software, not open source. It’s no surprise, then, that Kafka’s complex management and operations have led many businesses to buy Confluent’s commercial services instead.
Pulsar, by contrast, aims to simplify operations and be scalable. Based on the performance of Pulsar, our answers to the above questions are as follows:
- To keep up with the growth of services, is it quick and easy to scale the cluster?
Pulsar’s automatic load balancing feature automatically and immediately uses the newly added computing and storage capabilities of the cluster. This allows topics to be migrated between brokers to balance loads, and new Bookie nodes can immediately receive write traffic from new data fragments without having to manually rebalance or manage the brokers.
- Is the cluster available to multiple tenants (corresponding to multiple teams and users) out of the box?
Pulsar’s hierarchical architecture allows tenants and namespaces to map logically to organizations or teams. Through this same organization, Pulsar supports simple ACLs, quotas, autonomous service control, and even resource isolation, allowing cluster users to easily manage shared clusters.
- Do O&M tasks (such as replacing hardware) affect service availability and reliability?
Replacing Pulsar’s stateless broker is simple without fear of data loss. Bookie nodes automatically replicate all unreplicated slices of data, and the tools used to remove and replace nodes are built-in and easy to automate.
- Is it easy to replicate data for geographic redundancy or different access patterns?
Pulsar has built-in replication capabilities that can be used to seamlessly synchronize data across geographic areas or replicate data to other clusters for other functions (such as disaster recovery, analysis, and so on).
Pulsar’s features provide a more complete solution to real-world problems with streaming data than Kafka. From this perspective, Pulsar has a more complete set of core features that are simple to use, thus allowing users and developers to focus on the core requirements of the business.
Documentation and Learning
Since Pulsar is a newer technology than Kafka, the ecosystem is still underdeveloped, and documentation and training resources are still being added. But that’s where Pulsar has been in the past year and a half. Here are some of the main results:
- Pulsar Summit Virtual Conference 2020, Pulsar’s first global Summit, featured 36 presentations from more than 25 organizations and over 600 registered attendees.
- In 2020, we have created 50+ videos and training sections.
- Pulsar live weekly and interactive tutorials.
- Professional training by industry leading lecturers.
- Monthly webinars with strategic business partners.
- Doodle, OVHCloud, Tencent, Yahoo! Japan and other use cases.
For more information on Pulsar documentation and training, see StreamNative’s Resources website.
Enterprise support
Both Kafka and Pulsar provide enterprise-level support. Several large vendors, including Confluent, provide enterprise-level support for Kafka. StreamNative provides enterprise-level support for Pulsar, but StreamNative is still in its infancy. StreamNative provides enterprises with fully hosted Pulsar cloud services and Pulsar enterprise-level support services.
The StreamNative team is experienced and growing rapidly in the flow of messages and events. StreamNative was created by the Core members of Pulsar and BookKeeper. With the StreamNative team’s help, the Pulsar ecosystem has grown dramatically in just a few years, and with the support of strategic partners, this support will further accelerate Pulsar’s growth to meet the needs of a large number of use cases (more on this in the next article).
Recent major advances in Pulsar such as KoP (aka Kafka-on-PulSAR) are being launched by OVHCloud and StreamNative in March 2020. By adding a KoP protocol handler to an existing Pulsar cluster, users can migrate existing Kafka applications and services to Pulsar without code changes. In June 2020, China Mobile and StreamNative announced another major product — AoP (AMQP on Pulsar). AoP enables RabbitMQ applications to take advantage of Pulsar features such as the use of Apache BookKeeper and hierarchical storage support for unlimited event streams. This will be covered in more detail in the next article.
The ecological integration
With the rapid growth of Pulsar users, the Pulsar community has grown into a large, highly engaged, global community of users. The Pulsar ecosystem has seen a rapid growth in the number of plug-in tools around it, and the active Pulsar community has played an extremely important role in driving it. Over the past six months, the number of officially supported Connectors in the Pulsar ecosystem has grown dramatically.
To further support the Pulsar community, StreamNative recently launched StreamNative Hub. StreamNative Hub supports user lookup, download and integration. This platform will help accelerate the growth of the Pulsar Connector and plug-in ecosystem.
The Pulsar community has also been actively working closely with other communities to integrate projects from both sides. For example, the Pulsar community and the Flink community have been working together on the PulsAR-Flink Connector (part of the Flip-72). Using the Pulsar-Spark Connector, you can use Apache Spark to process processing events in Apache Pulsar. The SkyWalking Pulsar plug-in integrates With Apache SkyWalking and Apache Pulsar, allowing users to track messages via SkyWalking. In addition, there are a number of ongoing integration projects in the Pulsar community.
Multiple client libraries
Pulsar currently officially supports seven languages, while Kafka supports only one. The Confluent blog points out that Kafka currently supports 22 languages, but the official client does not support that many, and some languages are no longer maintained. At last count, the official Apache Kafka client supports only 1 language, while the official Apache Pulsar client supports 7 languages.
- Java
- C
- C++
- Python
- Go
- .NET
- Node
Pulsar also supports many clients developed by the Pulsar community, such as:
- Rust
- Scala
- Ruby
- Erlang
Performance and availability
Throughput, latency, and capacity
Both Pulsar and Kafka are widely used in multiple enterprise use cases, and each has its own advantages, as both can handle large volumes of traffic with roughly the same amount of hardware. Some users mistakenly believe that Pulsar uses more components and therefore needs more servers to match Kafka’s performance. While this idea does apply to some specific hardware configurations, in most similar resource configurations, Pulsar has the advantage of delivering more performance for the same resource.
For example, Splunk recently shared why they chose Pulsar over Kafka, saying that because of the layered architecture, Pulsar helped them reduce costs by 1.5 to 2 times and latency by 5 to 50 times. 2 -3 times lower operating costs (slide p. 34). The Splunk team found that this was because Pulsar made better use of disk IO, reduced CPU utilization, and better memory control.
Companies like Tencent chose Pulsar in large part because of its performance attributes. Tencent billing platform has millions of users, manages about 30 billion escrow accounts and is currently using Pulsar to process hundreds of millions of dollars of transactions per day, according to the white paper on Tencent billing platform. Tencent chose Pulsar over Kafka because of its predictable low latency, greater consistency, and persistence guarantees.
Order guarantee
Apache Pulsar supports four different subscription models. The subscription pattern for a single application is determined by sorting and consumption scalability requirements. Here are the four subscription models and their associated ordering guarantees:
- Both the exclusive and Dr Subscription patterns support strong ordering guarantees at the partition level, enabling concurrent consumption of messages on the same topic across consumers.
- The shared subscription pattern allows the number of Consumers to be extended beyond the number of partitions, which makes it a good fit for the worker queue use case.
- The key shared subscription pattern combines the benefits of other subscription patterns, allowing the number of Consumers to be extended beyond the number of partitions, as well as strong sorting guarantees at the key level.
For more information about the Pulsar subscription pattern and related sorting guarantees, see Subscriptions.
features
Built-in stream processing
Pulsar and Kafka have different goals for built-in stream processing. For more complex stream processing needs, Pulsar integrates two mature stream processing frameworks, Flink and Spark, and develops Pulsar Functions to handle lightweight computing. Kafka developed and used Kafka Streams, a mature stream processing engine.
However, using Kafka Streams is a little more complicated. Users need to understand the scenarios and methods of using a KStreams application. Also, KStreams are too complex for most lightweight computing use cases.
In addition, Pulsar Functions make it easy to implement lightweight computing use cases and allow users to create complex processing logic without having to deploy a separate adjacent system. Pulsar Functions also support native languages and easy-to-use apis. Users can write event-flow applications without having to learn complex apis.
Function Mesh was introduced in a recently submitted Pulsar Improvement Proposal (PIP). Function Mesh is a serverless event flow framework that combines multiple Pulsar Functions to facilitate building complex event flow applications.
Exactly – Once processing
Currently, Pulsar desupports exactly once producer via the broker. This major project is under development, stay tuned!
Pulsar has supported transactional message flows since PIP-31 and is still under development. This feature improves Pulsar’s messaging semantics and processing guarantees. In transaction-type flows, each message is written and processed only once, and there is no duplication or loss of data even if the broker or Function instance fails. Transactional messages not only simplify writing to an application using Pulsar or Pulsar Functions, but also extend the range of use cases that Pulsar supports. The development of the Pulsar feature is well underway and will be released in Pulsar 2.7.0, which is expected to be released in September 2020.
Topic (log) compression
Pulsar is designed to support user consumption data. Applications may need to choose to use raw data or compressed data. With this on-demand approach, Pulsar allows unlimited growth of uncompressed data controlled by retention policies, but still allows periodic compression to generate up-to-date physical views. The built-in tiered storage feature enables Pulsar to unload uncompressed data from BookKeeper into the cloud, thus reducing the cost of storing events over time.
In contrast to Pulsar, Kafka does not allow users to use raw data. Moreover, Kafka removes the original data immediately after data compression.
Use cases
Flow of events
Yahoo originally developed Pulsar as a platform for publishing/subscribing messages (also known as cloud messaging). However, Pulsar is now more than just a messaging platform, it is also a unified messaging and event flow platform. Pulsar introduces a series of tools as part of the platform that provide the necessary foundation for building event-flow applications. Pulsar supports the following event streams:
-
Unlimited event stream storage supports large-scale storage of events by scaling out log storage (via Apache BookKeeper), and Pulsar’s built-in tiered storage supports high-quality, low-cost systems such as S3, HDFS, and more.
-
A unified publish/subscribe messaging model enables users to easily add messages to applications. This model can scale according to traffic and user demand.
-
The protocol processing framework, protocol compatibility of Pulsar with Kafka (via Kafka-on-pulSAR, KoP), and AMQP (via AMQP-on-pulSAR) enable applications to produce and consume events anywhere using any existing protocol.
-
Pulsar IO provides a set of connectors that integrate with large ecosystems, allowing users to access data from external systems without having to write code.
-
Pulsar’s integration with Flink supports comprehensive event handling.
-
Pulsar Functions provides a lightweight serverless framework for handling incoming events.
-
Pulsar integration with Presto (Pulsar SQL) enables data specialists and users to analyze data and process business using ANSI compatible SQL.
Message routing
Through Pulsar IO, Pulsar Functions, and Pulsar Protocol Handler, Pulsar provides comprehensive routing Functions. Pulsar’s routing capabilities include content-based routing, message transformation, and message augmentation.
Pulsar’s routing capabilities are more robust than Kafka’s. Pulsar provides a more flexible deployment model for Connector and Functions. A simple deployment can run in the broker. In addition, deployments can be run in dedicated node pools (similar to Kafka Streams) that support large-scale scaling. Pulsar also integrates natively with Kubernetes. In addition, you can configure Pulsar to schedule Functions and Connector workloads as pods, taking advantage of the flexibility of Kubernetes.
The message queue
As mentioned earlier, Pulsar was originally developed as a unified messaging publish/subscribe platform. The Pulsar team gained insight into the strengths and weaknesses of existing open source messaging systems and designed Pulsar’s unified messaging model based on the team’s experience. The Pulsar message API supports both queues and streams. Not only can worker queues be implemented and messages are sent to competing consumers in a polling way (through shared subscriptions), but event flows can also be supported in two ways: one is based on the sequence of messages in partitions (through DISASTER recovery subscriptions); The second is based on the order of messages in the key range (shared subscription by key). Users can build messaging applications and event-flow applications on the same set of data without having to copy data to different data systems.
In addition, the Pulsar community is also trying to make Apache Pulsar natively support different messaging protocols (such as AoP, KoP) to extend Pulsar’s capabilities.
conclusion
The Pulsar community has been growing and growing, and the Pulsar ecosystem has grown as the Pulsar technology has developed and the number of use cases has increased in a virtuous circle.
Pulsar has a number of advantages that make it stand out as a unified message and event flow platform and a choice for more people. Compared to Kafka, Pulsar is more resilient and simpler to operate and scale.
Most new technologies take time to roll out and be adopted, but Pulsar offers a complete solution that is low maintenance and can be used immediately after installation. Pulsar covers all the basics needed to build an event flow application and integrates many built-in features, including multiple tools. Pulsar’s tools are also available for immediate use and do not require a separate installation.
StreamNative has been working to enhance existing features while developing new features for Pulsar, while promoting community growth. All is well on the way to achieving the 2020 annual mission and we look forward to releasing Pulsar 2.7.0 in September.
Stay tuned for the next article in this series: Use, use cases, support, and community for Pulsar.
Special thanks
Thanks to the many members of the Pulsar community who helped write and publish this article: Jerry Peng, Jesse Anderson, Joe Francis, Matteo Merli, Sanjeev Kulkarni, Addison Higham, and more.