This article is reprinted from The Java Advanced Architecture. The original Chinese version is translated from Lewis Fairweather’s article Pulsar Advantages Over Kafka. Apache Pulsar welcomes all of you to contribute your contributions, to make progress with the community, and to communicate with the authors! Layout: Tango @ StreamNative

About the Apache Pulsar

Apache Pulsar is the top project of Apache Software Foundation. It is the next generation cloud native distributed message flow platform, integrating message, storage and lightweight functional computing. It adopts the architecture design of computing and storage separation, supports multi-tenant, persistent storage, and multi-room cross-region data replication. It has strong consistency, high throughput, low latency, and high scalability. GitHub address: github.com/apache/puls…

Photo by Mikael Kristenson on Unsplash

introduce

Recently, I’ve been studying Pulsar and how it compares to Kafka. A quick search reveals the ongoing battle between the two most famous open source messaging systems.

As a user of Kafka, I’m really confused by some of the issues with Kafka, but Pulsar is an eye-opener and very exciting for me. So in the end, I managed to spend some time on background and do a lot of research. In this article, I’ll focus on Pulsar’s advantages and explain why Pulsar is better than Kafka. Let’s get started!

Kafka basics

Kafka is the king of messaging systems. It was created by LinkedIn in 2011 and spread widely with Confluent’s support. Confluent has released many new features and add-ons to the open source community, such as the Schema Registry for Schema evolution, Kafka Connect for easy streaming from other data sources, and more. Database to Kafka, Distributed stream processing for Kafka Streams, recently using KSQL to perform SQL-like queries on Kafka Topic, and more.

Kafka is fast, easy to install, very popular and can be used for a wide range or use cases. From a developer’s perspective, Apache Kafka, while always friendly, is a mess in terms of operations. So let’s review some of Kafka’s pain points.

Kafka demo

Kafka’s many pain points

  • Extending Kafka is tricky because of the architecture of the broker’s coupling with the data it stores. Stripping out a broker means that it must replicate topic partitions and replicas, which is time-consuming;

  • Local multi-tenant that is not completely isolated from tenants;

  • Storage becomes very expensive, and although data can be stored for a long time, it is rarely used because of the cost;

  • In case the replica is out of sync, messages may be lost;

  • The number of brokers, topics, partitions, and replicas must be planned and calculated in advance (ensuring future usage growth of the plan) to avoid scaling problems, which can be very difficult;

  • If you only need a messaging system, using offsets can be complicated;

  • Cluster rebalancing affects the performance of connected producers and consumers;

  • The MirrorMaker Geo replication mechanism is faulty. Companies like Uber have created their own solutions to overcome these problems.

As you can see, most of the problems are related to operations. Despite being relatively easy to install, Kafka is difficult to manage and tune. Moreover, it lacks the flexibility and elasticity that it should have.

Basics of Pulsar

Pulsar by Yahoo! Created in 2013 and donated to the Apache Foundation in 2016. Pulsar is now a top project of the Apache Software Foundation. Yahoo! , Verizon, Twitter and others have used it in production to process thousands of messages. It has the characteristics of low operating cost and flexibility. Pulsar aims to solve most of Kafka’s problems and make it easier to scale.

Pulsar is very flexible: it can be used in both distributed logging scenarios like Kafka and pure messaging systems like RabbitMQ. It supports multiple types of subscriptions, multiple delivery guarantees, retention strategies, and ways to handle schema evolution, among many other features.

Pulsar architecture diagram

The characteristics of the Pulsar

  • Built in multi-tenancy, different teams can use the same cluster and isolate it, solving many administrative challenges. It supports isolation, authentication, authorization, and quotas;

  • Multi-tier architecture: Pulsar stores all topic data in a professional data layer supported by Apache BookKeeper. The separation of storage and messaging solves many of the problems of scaling, rebalancing, and maintaining clusters. It also improves reliability and makes it almost impossible to lose data. In addition, BookKeeper can be directly connected while reading data without affecting real-time ingestion. For example, you can use Presto to perform SQL queries on topic, similar to KSQL, but without affecting real-time data processing;

  • Virtual Topic: Because of the N-tier architecture, there is no limit to the number of topics, and the topic and its storage are separated. Users can also create non-persistent topics;

  • N-tier storage: One problem with Kafka is that storage costs can be high. As a result, it is rarely used to store “cold” data, and messages are frequently deleted. Apache Pulsar can automatically unload old data to Amazon S3 or other data storage systems with the help of tiered storage, and still present a transparent view to the client. The Pulsar client can read nodes from the start of time as if all messages existed in the log;

  • Pulsar Function: easy to deploy, lightweight computing process, developer-friendly API, no need to run your own stream processing engine (such as Kafka);

  • Security: It has built-in proxy, multi-tenant security, pluggable authentication, and more;

  • Fast rebalancing: Partitions are divided into shards that can be easily rebalanced;

  • Deduplication and invalid fields on the server: You do not need to perform this operation on the client. You can also delete duplicate data during compression.

  • Built-in Schema Registry: Supports multiple policies and is easy to operate;

  • Geographic replication and built-in Discovery: easy to copy clusters to multiple regions;

  • Integrated load balancer and Prometheus metrics;

  • Multiple integration: Kafka, RabbitMQ, etc.

  • Support for multiple programming languages, such as GoLang, Java, Scala, Node, Python… .

  • Sharding and data partitioning are performed transparently on the server, and the client does not need to know about the sharding and data partitioning.

Pulsar features List:

Introduction to Pulsar

Getting started with Pulsar is easy. The prerequisite is to install the JDK.

1. Download Pulsar and unzip it (note: The latest version of Apache Pulsar is 2.7.0) :

$wget HTTP: / / https://archive.apache.org/dist/pulsar/pulsar-2.6.1/apache-pulsar-2.6.1-bin.tar.gz

2. Download the connector (optional) : ${connector} wget https://archive.apache.org/dist/pulsar/pulsar-2.6.1/connectors/ – 2.6.1. Nar

3. After downloading the NAR file, copy the file to the Connectors directory in the Pulsar directory

4. Start Pulsar! $ bin/pulsar standalone

Pulsar provides a CLI tool called Pulsar-Client that you can use to interact with the cluster.

$bin/pulsar-client produce my-topic –messages “hello-pulsar”

$bin/pulsar-client consume My-topic-s “first-subscription”

Akka flow sample

As an example of a client, we use Pulsar4s on Akka.

First, we need to create a Source to consume the data stream, and all that is needed is a function that will create the consumer on demand and look up the message ID:

val topic = Topic("persistent://standalone/mytopic")

val consumerFn = () => client.consumer(ConsumerConfig(topic, subscription)) 
Copy the code

Then, we pass the ConsumerFn function to create the source:

import com.sksamuel.pulsar4s.akka.streams._

val pulsarSource = source(consumerFn, Some(MessageId.earliest)) 
Copy the code

The materialized value of the Akka source is an instance of Control, which provides a “close” method that can be used to stop consuming messages. Now we can use Akka Streams to process the data as usual.

To create a sink:

val topic = Topic("persistent://standalone/mytopic")

val producerFn = () => client.producer(ProducerConfig(topic))

import com.sksamuel.pulsar4s.akka.streams._

val pulsarSink = sink(producerFn)
Copy the code

The complete example is from Pulsar4s.

object Example {

  import com.sksamuel.pulsar4s.{ConsumerConfig, MessageId, ProducerConfig, PulsarClient, Subscription, Topic}
  import org.apache.pulsar.client.api.Schema

  implicit val system: ActorSystem = ActorSystem()
  implicit val materializer: ActorMaterializer = ActorMaterializer()
  implicit val schema: Schema[Array[Byte]] = Schema.BYTES

  val client = PulsarClient("pulsar://localhost:6650")

  val intopic = Topic("persistent://sample/standalone/ns1/in")
  val outtopic = Topic("persistent://sample/standalone/ns1/out")

  val consumerFn = () => client.consumer(ConsumerConfig(topics = Seq(intopic), subscriptionName = Subscription("mysub")))
  val producerFn = () => client.producer(ProducerConfig(outtopic))

  val control = source(consumerFn, Some(MessageId.earliest))
    .map { consumerMessage => ProducerMessage(consumerMessage.data) }
    .to(sink(producerFn)).run()

  Thread.sleep(10000)
  control.stop()

}
Copy the code

Pulsar Function example

The Pulsar Function processes messages from one or more topics, transforms them, and outputs the results to another topic:

Pulsar Function

You can choose between two interfaces to write functions:

  • Language native interface: no specific Pulsar library or special dependencies required; No context access, only Java and Python support;

  • Pulsar Function SDK: Available for Java/Python/Go and provides more features such as access to context objects.

A simple function is written to transform messages using the language’s native interface:

def process(input): return "{}!" .format(input)Copy the code

This simple function, written in Python, simply adds an exclamation mark to all incoming strings and publishes the resulting string to topic.

Using the SDK requires importing dependencies. For example, in Go, we could write:

package main import ( "context" "fmt" "github.com/apache/pulsar/pulsar-function-go/pf" ) func HandleRequest(ctx context.Context, in []byte) error { fmt.Println(string(in) + "!" ) return nil } func main() { pf.Start(HandleRequest) }Copy the code

If you want to publish serverless functionality and deploy it to a cluster, use pulsar-admin CL; If using Python, we could write:

$ bin/pulsar-admin functions create \ --py ~/router.py \ --classname router.RoutingFunction \ --tenant public \ --namespace default \ --name route-fruit-veg \ --inputs persistent://public/default/basket-items Pulsar Function An important feature of the function is that the user can set the delivery guarantee when publishing the function:  $ bin/pulsar-admin functions create \ --name my-effectively-once-function \ --processing-guarantees EFFECTIVELY_ONCECopy the code

The options are as follows:

The advantage of the Pulsar

Let’s review Pulsar’s main advantages over Kafka:

  • More features: Pulsar Function, multi-tenant, Schema Registry, N-tier storage, multiple consumption modes and persistence modes, etc.

  • More flexibility: 3 subscription types (exclusive, shared, and failover), users can manage multiple topics on a single subscription;

  • Persistence options: non-persistent (fast), persistent, compressed (last key per message only), the user can choose to deliver guaranteed. Pulsar has server-side deduplication and invalid word retention policies and TTL features.

  • No need to define extension requirements in advance;

  • Pulsar can replace RabbitMQ or Kafka by supporting both queue and stream message consumption models.

  • Storage is separated from the broker, so it scales better, rebalancing faster, and is more reliable.

  • Easy operation and maintenance: architecture decoupling and N-tier storage;

  • Integration with Presto’s SQL to query the store directly without affecting the broker;

  • Lower cost storage with the n-tier automatic storage option;

  • Faster: Benchmarks show better performance in all situations. Pulsar has lower latency and better scalability.

  • Pulsar Function supports serverless computing without deployment management.

  • Schema Registry integration;

  • Integrated load balancer and Prometheus metrics;

  • Geographic replication works better and is easier to set up. Pulsar built-in discovery-ability;

  • There is no limit to the number of topics you can create.

  • Compatible with Kafka and easy to integrate.

Pulsar disadvantage

Pulsar is not perfect, and Pulsar also has some problems:

  • A relative lack of support, documentation and case studies;

  • The N-tier architecture leads to the need for more components: BookKeeper;

  • There are fewer plug-ins and clients than Kafka.

  • With less support in the cloud, Confluent has a managed cloud product.

However, these conditions are rapidly improving, and Pulsar is now being used by more and more companies and organizations. StreamNative, Apache Pulsar’s commercial support company, has launched the StreamNative Cloud. Apache Pulsar is growing rapidly and we can all see the changes that are exciting. Confluent has posted a blog comparing Pulsar and Kafka, but please note that these questions can be biased.

Pulsar application scenario

Pulsar can be used for a wide range of scenarios:

  • Publish/subscribe queue messaging;

  • Distributed log;

  • Event tracing, for persistent event storage;

  • Micro service;

  • SQL analysis;

  • Serverless function.

When should Pulsar be considered

  • Both queues like RabbitMQ and stream handlers like Kafka are needed;

  • Easy-to-use geographic replication is required;

  • Implement multi-tenancy and ensure access for each team;

  • You need to hold messages for a long time and don’t want to offload them to another store;

  • High performance is required, and benchmarks show Pulsar provides lower latency and higher throughput;

If you are in the cloud, consider a cloud-based solution. Cloud providers have different services that cover certain scenarios. For example, for queued messages, cloud providers provide many services, such as Google Pub/Sub; For distributed logging, there is Confluent Cloud or AWS Kinesis; StreamNative also offers cloud services based on Pulsar. Cloud providers also provide very good security. Pulsar has the advantage of being able to provide a lot of functionality on one platform. Some teams may use it as a messaging system for microservices, while others use it as a distributed log for data processing.

conclusion

I’m a big fan of Kafka, and here’s why I’m so interested in Pulsar: Competition drives innovation.

Kafka is a mature, resilient and proven product that has been so successful around the world that it’s hard to imagine most companies doing without it. But I do see Kafka becoming a victim of its own success, as the huge growth required to support many large companies slows down feature development, and it takes too long to remove important features like ZooKeeper dependencies, creating space for tools like Pulsar to flourish.

Pulsar, despite its youth, is gaining momentum and needs to be analyzed, benchmarked, studied, and POC before it can be incorporated into an organization. Start small, do a proof of concept before migrating Kafka to Pulsar, and assess the impact before deciding on a full migration.

reading

Apache Pulsar Performance Tuning Architecture

· Performance tuning of Pulsar read and write process

· Why is Pulsar a cloud-native messaging platform?