It takes about 10 minutes to read this article.
After LinkedIn created Apache Kafka in 2011, the messaging system became the only alternative for large-scale messaging systems. Why is that? Because these messaging systems need to deliver millions of messages a day, the messaging scale is huge (Twitter averaged 5 million tweets per day in 2018, and 100 million users per day). At the time, we didn’t have MOM systems to handle the ability to stream data based on a large number of subscriptions. As a result, many big-name companies like LinkedIn, Yahoo, Twitter, Netflix, and Uber have opted for Kafka.
Now in 2019, the world has changed dramatically, with billions of new messages per day, and support platforms need to expand to keep up with the growing demand. Therefore, the messaging system needs to continue to scale seamlessly without impacting the customer. Kafka has a lot of scaling problems, and the system is difficult to manage. Some Kafka fans may complain about this, but it’s not a personal bias. I’m a Kafka fan myself. Objectively speaking, with the development and innovation of the world, new tools are more convenient to use than the old tools, and we naturally feel that the old tools are full of bugs and difficult to use. Naturally, always.
A new product came into being — Apache Pulsar!
Yahoo founded Pulsar in 2013 and donated it to the Apache Foundation in 2016. Pulsar has now become Apache’s top project, gaining worldwide recognition. Both Yahoo and Twitter use Pulsar, with Yahoo sending 100 billion messages and more than two million topics a day. That amount of information sounds incredible, but it’s true!
Let’s take a look at Kafka pain points and Pulsar’s corresponding solutions.
- Kafka is difficult to scale because Kafka persists messages in brokers. Migrating a topic partition requires copying the partition’s data to other brokers, a time-consuming operation.
- When you need to change the partition size to obtain more storage space, it conflicts with the message index and scrambles the message order. So Kafka can be tricky if users need to guarantee the order of messages.
- Leader selection can be out of order if the partition copy is not in ISR (synchronous) state. Generally, a copy of the ISR should be requisitioned when the original primary partition fails, but this is not guaranteed. If the Settings do not specify that only ISR replicas can be chosen as the leader, choosing a replica that is out of sync is worse than having no broker serving that partition.
- With Kafka, you need to plan the number of brokers, themes, partitions, and replicas based on your current situation and with future incremental plans in mind to avoid the problems caused by Kafka scaling. This is the ideal situation, but the actual situation is difficult to plan for, and scaling requirements are inevitable.
- Partitioning rebalancing in A Kafka cluster affects the performance of the producers and consumers involved.
- In the event of a failure, a Kafka topic cannot guarantee message integrity (especially in the case of point 3, messages are likely to be lost if you need to extend them).
- Using Kafka involves dealing with offsets, which can be a pain because the broker does not maintain the consumer’s consumption state.
- If the usage is high, you must delete the old messages as soon as possible, or you will run out of disk space.
- Kafka’s native cross-country replication mechanism (MirrorMaker) is known to have problems with cross-country replication even in two data centers. As a result, even Uber had to create another solution to the problem, calling it uReplicator (eng.uber.com/ureplicator…
- For real-time data analysis, you have to use third-party tools such as Apache Storm, Apache Heron, or Apache Spark. At the same time, you need to ensure that these third-party tools are sufficient to support incoming traffic.
- Kafka does not have a native multi-tenant feature to achieve full isolation of tenants, it does so by using security features such as topic authorization.
Of course, in a production environment, architects and engineers have ways to address these issues; But it’s a headache in terms of platform/solution or site reliability, and it’s not as simple as fixing the logic in code and then deploying the packaged binaries into production.
Now, let’s talk about Pulsar, the leader in the competitive field.
What is Apache Pulsar?
Apache Pulsar is an open source distributed publish-subscribe messaging system originally created by Yahoo! If you know Kafka, Pulsar is similar in nature to Kafka.
Pulsar performance
Performance is what Pulsar does best, and Pulsar is much faster than Kafka, according to GigaOm (Gigaom.com/), a Texas-based technology research and analytics firm.
Pulsar is 2.5 times faster and 40 percent slower than Kafka. IO/PDF /Gigaom-…
Note that this performance comparison is for 1 topic for 1 partition and contains 100 byte messages. Pulsar can send 220,000+ messages per second, as shown below.
Pulsar does this really well! For this reason, switching from Kafka to Pulsar is definitely worth it. Next, I’ll take a closer look at Pulsar’s strengths and features.
Advantages and features of Apache Pulsar
Pulsar supports both use as a message queue in Pub-sub mode and sequential access (similar to Kafka’s offset-based reading), giving users great flexibility.
Pulsar has a different architecture for data persistence than Kafka. Kafka uses log files in the local broker, while Pulsar stores all topic data in the dedicated data layer of Apache BookKeeper. In short, BookKeeper is a highly scalable, highly resilient, low-latency storage service optimized for real-time persistent data workloads. Therefore, BookKeeper ensures the availability of data. Kafka log files reside in various brokers as well as in catastrophic server failures, so Kafka log files can become problematic and not fully ensure the availability of data. This guaranteed persistence layer gives Pulsar another advantage, that the broker is stateless. This is fundamentally different from Kafka. The advantage of Pulsar is that the broker can scale horizontally seamlessly to meet increasing demand, since it does not need to move real data as it scales.
What if a Pulsar broker goes down? The topic is immediately reassigned to another broker. Since the broker has no topic data on disk, service discovery handles producer and consumer on its own.
Kafka needs to clean up old data to use disk space; Unlike Kafka, Pulsar stores topic data in a hierarchical structure that can be connected to other disks or Amazon S3, which allows you to scale and unload topic data indefinitely. Even cooler, Pulsar seamlessly displays data to consumers as if it were on the same drive. Since you don’t need to clean up old Data, you can use these organized Pulsar themes as “Data lakes,” and this user scenario is still valuable. Of course, you can also set Pulsar to clear old data if needed.
Pulsar native supports multi-tenancy with data isolation at the topic namespace level; Kafka does not achieve this isolation. In addition, Pulsar supports fine-grained access control capabilities that make Pulsar applications more secure and reliable.
Pulsar has multiple client libraries available for Java, Go, Python, C++, and WebSocket languages.
Pulsar’s native support as a Service (FaaS) is a cool feature that, like Amazon Lambda, can analyze, aggregate, or aggregate real-time data streams in real time. With Kafka, you also need to use a stream processing system like Apache Storm, which can be costly and difficult to maintain. Pulsar beats Kafka in this respect. As of now, Pulsar Functions support Java, Python, and Go, with other languages to be supported in future releases.
User cases for Pulsar Functions include content based routing, aggregation, message formatting, message cleaning, etc.
The following is an example of the word count.
package org.example.functions; import org.apache.pulsar.functions.api.Context; import org.apache.pulsar.functions.api.Function; import java.util.Arrays; public class WordCountFunction implements Function<String, Void> { // This is invoked every time messages published to the topic @Override public Void process(String input, Context context) throws Exception { Arrays.asList(input.split(" ")).forEach(word -> { String counterKey = word.toLowerCase(); context.incrCounter(counterKey, 1); }); return null; }}Copy the code
Pulsar supports multiple data sinks for routing processed messages for major products such as Pulsar topics themselves, Cassandra, Kafka, AWS Kinesis, elastic Search, Redis, Mongo DB, Influx DB, etc.
In addition, processed message flows can be persisted to disk files.
Pulsar uses Pulsar SQL to query historical messages and Presto engine to efficiently query data in BookKeeper. Presto is a high-performance distributed SQL query engine for big data solutions that queries data from multiple data sources in a single query. The following is an example of a Pulsar SQL query.
show tables in pulsar."public/default"
Copy the code
Pulsar has a powerful cross-region replication mechanism that synchronizes messages instantly between different clusters in different regions to maintain message integrity. When messages are generated on a Pulsar topic, they are first retained in the local cluster and then asynchronously forwarded to the remote cluster. In Pulsar, enabling replication across domains is tenant based. Cross-region replication can be enabled between two clusters only when the created tenant can access both clusters.
For messaging channel security, Pulsar native supports TLS and JWT token-based authorization mechanisms. Therefore, you can specify who can publish or use messages on which topics. In addition, to improve security, Pulsar Encryption allows applications to encrypt all messages on the producer side and decrypt them when Pulsar passes the encrypted messages to the consumer side. Pulsar performs encryption using the public/private key pair configured by the application. The consumer with a valid key can decrypt the encrypted message. But there is a performance penalty because every message needs to be encrypted and decrypted before it can be processed.
If you’re currently using Kafka and want to migrate to Pulsar, take comfort in the fact that Pulsar native supports using Kafka data directly through the connector, or you can import existing Kafka application data into Pulsar, which is fairly easy.
conclusion
This article is not to say that large-scale message processing platforms cannot use Kafka, but Pulsar. I’d like to stress that Pulsar, Kafka’s pain point, already has a good solution, which is a good thing for any engineer or architect. In addition, Pulsar is much faster architecturally in large messaging solutions, and with Yahoo and Twitter (and many others) deploying Pulsar into production, Pulsar is stable enough to support any production environment. Moving from Kafka to Pulsar is a bit of a learning curve, but the ROI is still significant!
For Pulsar’s enterprise services and support, please contact StreamNative ([email protected]).
By Anuradha Prasanna
Translation: Zhanying
Review: Jennifer + Sijie + Yjshen
Editor: Irene
Apache Pulsar is the next generation cloud native distributed streaming data platform, which originated from Yahoo. It was opened in December 2016 and officially became the Apache top-level project in September 2018. It gradually evolved from a single messaging system into a streaming data platform integrating messaging, storage and functional lightweight computing. During the rapid growth of Apache Pulsar, the community partners also committed to evangelizing outside silicon Valley, and started an extraordinary journey in the Chinese community.
Click on thelinkRead the original English text.