The original article was written by Maximilian Michels and translated by Sijia@StreamNative.

English link: dzone.com/articles/10…

About the Apache Pulsar

Apache Pulsar is the top project of Apache Software Foundation. It is the next generation cloud native distributed message flow platform, integrating message, storage and lightweight functional computing. It adopts the architecture design of computing and storage separation, supports multi-tenant, persistent storage, and multi-room cross-region data replication. It has strong consistency, high throughput, low latency, and high scalability. GitHub address: github.com/apache/puls…

Apache Pulsar has a number of unique advantages, such as tiered storage, stateless broker, replication across territories, and multi-tenancy, that give Pulsar a distinct edge over Kafka.

If you’re still on the fence between Pulsar and Kafka, hopefully this article’s top 10 benefits of Pulsar will help you make your decision.

Stateless Brokers (easily extensible)

With Kafka, you need to set the number of brokers. Because Kafka stores data in brokers, the topic must be repartitioned to take advantage of the new partition if the Settings are too small and more brokers are needed to extend the application.

Pulsar stores the state of the broker in a separate layer (Apache BookKeeper). The Broker layer is decoupled from the storage layer to add or use brokers without moving data. That is, you can take full advantage of the new broker without having to repartition existing data.

Tiered storage (persist messages and reduce storage costs)

Kafka retains data for seven days by default, meaning it deletes data after a week. Pulsar keeps all unack data by default and immediately deletes ack data.

Both Kafka and Pulsar support customizing retention policies to modify data retention periods. However, the amount of data that can be stored in the primary storage is limited, and increasing the amount of data also increases the storage cost. Hierarchical storage supports the selection of cost-effective and appropriate storage for different types of data. For example, historical data is only used in bootstrap (backfill) applications, so you can choose a different storage type for historical data.

The storage layer of Pulsar adopts sharding architecture, and the sharding is distributed on storage nodes. With Pulsar, shards can either be written to the main store or unloaded to other types of storage. Therefore, Pulsar supports tiered storage, but Kafka does not currently support tiered storage. Tiered storage provides multiple storage tiers, such as primary storage (based on SSDS) and historical storage (S3), so you can easily obtain the storage status of each tier.

Quorum based replication (improved delayed consistency)

Pulsar uses a quorum – based algorithm for replication, while Kafka uses a leader-follower algorithm. Although Pulsar and Kafka have the same guarantees, the quorum based approach generates higher latency consistency. For many applications, delayed consistency is important, for example, to get some SLAs (such as the response time of a query).

Replication across territories (high availability)

Pulsar natively supports replication across geographies, so Pulsar can replicate data across data centers in different geographic locations. Having copies of messages in multiple data centers is especially important when data center outages or network partitions improve availability.

Multi-tenancy (simplified architecture and administration)

Pulsar supports multi-tenancy, where multiple user groups share the same cluster through access control or in completely different namespaces. Kafka does not currently support multi-tenancy, so sharing clusters requires either an abstract layer based on the messaging system, or a separate cluster for each user group.

Information encryption (improved security)

Pulsar provides full end-to-end encryption from the client to the storage node. Full encryption is generally required for data security. Kafka does not currently support end-to-end encryption.

Multi-protocol support (easy integration with existing applications)

Pulsar not only supports multiple protocols (such as RabbitMQ, AMQP, Kafka), but also supports reading history stream events in parallel using Presto.

Pulsar Functions (One-stop flow processing)

Pulsar Functions is a lightweight stream processing method based on Pulsar, similar in concept to Kafka Streams. Pulsar Functions are deployed directly on broker nodes (or as containers in a Kubernetes cluster), whereas Kafka Streams is a separate application. Through Pulsar Functions, Pulsar can directly solve many flow processing tasks and simplify operations.

Apache Flink Integration (batch and stream processing)

The Pulsar community has had a number of public discussions about the limitations of Pulsar Functions, such as state management, DAG processes, etc. If Pulsar Functions are not suitable for your usage scenario, consider another popular open source tool, Pulsar Flink Connector.

Pulsar has been proven (in mass production environments)

Pulsar has a number of design advantages. Originally developed by the Yahoo team for use internally at Yahoo. In 2016, Yahoo donated Pulsar to the Apache Software Foundation. Since then, Pulsar has been adopted by many mission-critical applications, such as Tencent, Splunk, and others.

Pulsar is not perfect

Pulsar requires two systems: Apache BookKeeper and Apache ZooKeeper, while Kafka requires “only” ZooKeeper. Multiple systems increase operational complexity, but they also make Pulsar more flexible. Since both Kafka and Pulsar use other systems, both require setup and maintenance.

The choice between Pulsar and Kafka is not an easy one, and the decision has consequences. I’ve summarized the main differences between Pulsar and Kafka in this article, hoping that this information will help you and your team make a choice. For more information about Apache Pulsar, visit pulsar.apache.org or subscribe to email notifications.