Apache Pulsar and Apache Kafka in performance, application, ecology and other all-round comparison

This is the 8th day of my participation in the Gwen Challenge in November. See details: The Last Gwen Challenge in 2021.

Pulsar

Apahce Pulasr is an enterprise-class publishe-subscribe messaging system, which was originally developed by Yahoo. It is the next generation cloud native distributed messaging platform, integrating messaging, storage and lightweight functional computing. It adopts computing and storage separation architecture design to support multi-tenant, persistent storage and multi-room cross-region data replication. It has strong consistency, high throughput, low latency, and high scalability.

Pulsar is very flexible: it can be used in both distributed logging scenarios like Kafka and pure messaging systems like RabbitMQ. It supports multiple types of subscriptions, multiple delivery guarantees, retention strategies, and ways to handle schema evolution, among many other features.

1. Characteristics of Pulsar

Built-in multi-tenancy: Different teams can use the same cluster and isolate it, solving many administrative challenges. It supports isolation, authentication, authorization, and quotas;
Multi-tier architecture: Pulsar stores all topic data in a professional data layer supported by Apache BookKeeper. The separation of storage and messaging solves many of the problems of scaling, rebalancing, and maintaining clusters. It also improves reliability and makes it almost impossible to lose data. In addition, BookKeeper can be directly connected while reading data without affecting real-time ingestion. For example, you can use Presto to perform SQL queries on topic, similar to KSQL, but without affecting real-time data processing;
Virtual Topic: Because of the N-tier architecture, there is no limit to the number of topics, and the topic and its storage are separated. Users can also create non-persistent topics;
N-tier storage: One problem with Kafka is that storage costs can be high. As a result, it is rarely used to store “cold” data, and messages are frequently deleted. Apache Pulsar can automatically unload old data to Amazon S3 or other data storage systems with the help of tiered storage, and still present a transparent view to the client. The Pulsar client can read nodes from the start of time as if all messages existed in the log;

2. Pulsar storage architecture

Pulsar’s multi-tier architecture affects how data is stored. Pulsar divides the topic into segments, which are then stored on Apache BookKeeper’s storage nodes to improve performance, scalability, and availability.

Pulsar’s infinitely distributed logging is sharding centric, implemented with extended log storage (via Apache BookKeeper) and built-in support for tiered storage, so sharding can be evenly distributed across storage nodes. Because the data associated with any given topic is not tied to a particular storage node, storage nodes can be easily replaced or scaled down. In addition, the smallest or slowest node in the cluster is not a storage or bandwidth disadvantage.

The Pulsar architecture enables partition management, load balancing, and rapid scaling and high availability with Pulsar. These two things make Pulsar ideal for building mission-critical services such as billing platforms for financial applications, transaction processing systems for e-commerce and retailers, and real-time risk control systems for financial institutions.

With the powerful Netty architecture, the movement of data from Producers to broker and then to bookie is zero-copy and no copies are made. This feature is very friendly for all streaming scenarios because the data is transferred directly over the network or disk with no performance penalty.

3. Pulsar message consumption

Pulsar’s consumption model adopts the method of flow pull. Stream pull is an improved version of long polling, which not only implements zero wait between individual calls and requests, but also provides two-way message flow. Pulsar achieves low end-to-end latency through the flow pull model, which is lower than all existing long-polling messaging systems such as Kafka.

Kafka

Liverpoolfc.tv: kafka.apache.org/

Kafka was originally developed by linkedin and written in Scala. Kafka is a distributed, partitioned, multi-copy, multi-subscriber logging system (distributed MQ system) that can be used to search logs, monitor logs, access logs, etc.

Kafka is a distributed, partitioned, the replicated commit logservice. It provides features similar to JMS, but with a completely different design implementation, and is not an implementation of the JMS specification. Kafka saves messages according to Topic. The sender becomes Producer and the receiver becomes Consumer. In addition, a Kafka cluster consists of multiple Kafka instances, with each instance (server) becoming a broker. Both Kafka clusters, producers and consumers rely on ZooKeeper to ensure that the system is available and that the cluster stores meta information.

1. Benefits of Kafka

Reliability: distributed, partitioned, duplicate, and fault tolerant.
Scalability: The Kafka messaging system scales easily without downtime.
Durability: Kafka uses distributed commit logging, which means messages are saved to disk as quickly as possible, so it is persistent.
Performance: Kafka has high throughput for both published and scheduled messages. Even though it stored many terabytes of messages, it still showed steady performance.
Kafka is very fast: guarantees zero downtime and zero data loss.

2. Distributed publish and subscribe system

Apache Kafka is a distributed publish-subscribe messaging system with a powerful queue that can process large amounts of data and enable the delivery of messages from one endpoint to another. Kafka is suitable for both offline and online message consumption. Kafka messages are kept on disk and replicated within the cluster to prevent data loss. Kafka is built on top of the ZooKeeper synchronization service. It integrates well with Apache and Spark for real-time streaming data analysis.

3. Main application scenarios of Kafka

1. Index analysis

Kafka is typically used to manipulate monitoring data. This design aggregates statistics from distributed applications to produce centralized data feedback for operations

2. Log aggregation solution

Kafka can be used to collect logs from multiple servers across an organization and make them available to multiple servers in a standard format.

3. Streaming processing

The strong durability of Kafka is also very useful in the streaming processing context of frameworks (Spark, Storm, ﬂink) that read data in heavy themes, align it for processing, and write the processed data to new themes for use by users and applications.

Contrast Kafka with Pulsar

1. Main advantages of Pulsar:

More features: Pulsar Function, multi-tenant, Schema Registry, N-tier storage, multiple consumption modes and persistence modes, etc.
More flexibility: 3 subscription types (exclusive, shared, and failover), users can manage multiple topics on a single subscription;
Easy operation and maintenance: architecture decoupling and N-tier storage;
Integration with Presto’s SQL to query the store directly without affecting the broker;
Lower cost storage with the n-tier automatic storage option;

2. Disadvantages of Pulsar

Pulsar is not perfect, and Pulsar also has some problems:

A relative lack of support, documentation and case studies;
The N-tier architecture leads to the need for more components: BookKeeper;
Kafka has fewer plug-ins and clients;
With less support in the cloud, Confluent has a managed cloud product.

3. When should YOU consider Pulsar

Both queues like RabbitMQ and stream handlers like Kafka are needed;
Easy-to-use geographic replication is required;
Implement multi-tenancy and ensure access for each team;
You need to hold messages for a long time and don’t want to offload them to another store;
High performance is required, and benchmarks show Pulsar provides lower latency and higher throughput;

In short, Pulsar is still relatively new, the community is not perfect, the use of enterprises is relatively few, online valuable discussion and problem solving is relatively few, far less than Kafka ecosystem, and the number of users is very large, currently Kafka is still the king of big data message queue! So we’ll stick with Kafka!