About the Apache Pulsar
Apache Pulsar is the top project of Apache Software Foundation. It is the next generation cloud native distributed message flow platform, integrating message, storage and lightweight functional computing. It adopts the architecture design of computing and storage separation, supports multi-tenant, persistent storage, and multi-room cross-region data replication. It has strong consistency, high throughput, low latency, and high scalability. At present, Apache Pulsar has been adopted by many large Internet and traditional industry companies at home and abroad, cases are distributed in artificial intelligence, finance, telecom operators, live and short video, Internet of Things, retail and e-commerce, online education and other industries, such as Comcast, Yahoo! , Tencent, China Telecom, China Mobile, BIGO, VIPKID, etc.
Confluent recently conducted a benchmark test comparing throughput and latency differences between Kafka, Pulsar, and RabbitMQ. According to the Confluent blog, Kakfa can achieve “optimal throughput” with “low latency”, while RabbitMQ can achieve “low latency” with “low throughput”. Overall, the benchmark results show that Kafka is clearly superior in “speed.”
Kafka is a mature technology, but many companies today — from multinationals to innovative startups — chose Pulsar first. At the Splunk Conference CONF20, Sendur Sellakumar, Splunk’s chief product officer, announced the company’s decision to replace Kafka with Pulsar:
“… We’ve used Apache Pulsar as the base flow. We bet the future of the company on a long-term architecture for enterprise-class multi-tenant streams.”
— Sendur Sellakumar, CHIEF product officer, Splunk
Splunk is just one of many companies using Pulsar. These companies chose Pulsar because in modern elastic cloud environments such as Kubernetes, Pulsar can scale laterally to handle large amounts of data in an economical and efficient manner, with no single point of failure. Pulsar’s built-in features, such as automatic data rebalancing, multi-tenancy, replication across geographies, and persistent tiered storage, simplify operations and make it easier for teams to focus on business goals.
Developers chose Pulsar because these unique features and capabilities make Pulsar a cornerstone of streaming data.
With that in mind, take a closer look at Confluent’s benchmark setup and conclusions. We found two issues highly controversial. First, Confluent has limited understanding of Pulsar, which is the biggest source of inaccurate conclusions. If you don’t know Pulsar, you can’t test Pulsar performance with the right metrics.
Second, Confluent’s performance tests are based on a narrow set of test parameters. This limits the applicability of the results and does not provide readers with accurate results that match different workloads and real-world application scenarios.
In order to provide the community with more accurate test results, we decided to address these issues and repeat the tests. Key adjustments include:
- We adjusted the benchmark Settings to include the persistence levels supported by Pulsar and Kafka, comparing throughput and latency at the same persistence level.
- We fixed the OpenMessaging Benchmark (OMB) framework to eliminate variables from using different instances, and corrected configuration errors in the OMB Pulsar driver.
- Finally, we measured other performance factors and conditions, such as different numbers of partitions and mixed workloads that included write, Retail-read, and catch-up read, to get a fuller picture of performance.
With that done, we repeated the tests. The results showed that Pulsar performed significantly better than Kafka for scenarios that were closer to real-world workloads, while Pulsar performed as well as Kafka for the basic scenarios Confluent used in the tests.
The following sections highlight the most important conclusions from the test. In the StreamNative benchmark Results section, we cover test setup and test reports in detail.
Summary of StreamNative benchmark results
Pulsar achieves a publish and end-to-end throughput of 605 MB/s (the same as Kafka) and a catch-up read throughput of 3.5 GB/s (3.5 times higher than Kafka) with the same persistence guarantee as Kafka. Pulsar’s throughput is not affected by an increase in the number of partitions and a change in the persistence level, whereas Kafka’s throughput is severely affected by a change in the number of partitions or persistence level.
Pulsar has significantly lower latency than Kafka in different test instances, including different subscriptions, different topic numbers, and different persistence guarantees. PulsarP99 latency is between 5 and 15 milliseconds. KafkaP99 latency can be several seconds long and can be greatly affected by the number of topics, the number of subscriptions, and different persistence guarantees.
Pulsar has significantly better I/O isolation than Kafka. The PulsarP99 release delay is still around 5 ms when a consumer catches up reading the history data. In contrast, Kafka’s latency is severely affected by catch Up read. KafkaP99 release latency may increase from milliseconds to seconds.
All of our benchmarks are open source (Github), so interested readers can generate their own results or delve deeper into the test results and metrics provided in the repository.
While our benchmark is more accurate and comprehensive than Confluent’s benchmark, it does not cover all scenarios. At the end of the day, no benchmark can substitute testing with your own hardware/real workload. Readers are also encouraged to evaluate other variables and scenarios and test them with their own Settings and environments.
Delve deeper into the Confluent benchmark
Confluent uses the OpenMessaging Benchmark testing (OMB) framework as the basis for its benchmark testing, with some modifications. In this section, we describe the problems found in the Confluent benchmark and explain how they affect the accuracy of Confluent test results.
Confluent has a setup problem
The conclusion of Confluent benchmark test is incorrect because Pulsar parameters are set improperly. We’ll cover these issues in more detail in the StreamNative benchmark section. In addition to Pulsar tuning issues, Confluent sets different persistence guarantees for Pulsar and Kafka. The level of persistence affects performance, and the persistence Settings of two systems are the same for comparison to be useful.
Confluent engineers use a default persistence guarantee for Pulsar, which is higher than Kafka. Increasing the persistence level can severely affect latency and throughput, so Confluent tests place higher demands on Pulsar than Kafka. The version of Pulsar used by Confluent does not yet support lowering persistence to the same level as Kafka, but an upcoming version of Pulsar does support this level and was used in this test. If Confluent engineers used the same persistence Settings on both systems, the test results should show an accurate comparison. We certainly can’t blame Confluent engineers for not using features that haven’t been released yet. However, the test record does not provide the necessary scenario and is treated as the result of an equivalent persistence setting. Additional scenarios are provided in this article.
OMB framework issues
The Confluent benchmark follows the OMB Framework guidelines, which recommend the use of the same instance type in multiple event flow systems. However, in our tests, we found a large number of deviations between different instances of the same type, especially in the case of disk I/O failures. To minimize this discrepancy, we used the same instances for Pulsar and Kafka each time we ran them, and we found that these instances greatly improved the accuracy of our results. Small differences in disk I/O performance can make a big difference to overall system performance. We propose to update the OMB framework guidelines and consider adopting this recommendation in the future.
Problems with Confluent’s research methods
The Confluent benchmark tests only a limited number of scenarios. For example, real workloads include writing, retailing read, and catch-up read. Confluent tested only the situation that occurs when a consumer is reading the latest news near the “tail” of the log. In contrast, catch-up read occurs when the consumer has a large number of historical messages and must consume the “catch-up” position to the tail of the log message, which is a common task critical in real systems. If catch-up read is not taken into account, it will seriously affect writing and retailing read delays. Because the Confluent benchmark focuses only on throughput and end-to-end latency, it does not provide comprehensive results on expected behavior under various workloads. To further bring the results closer to real-world scenarios, we thought it was important to benchmark different numbers of subscriptions and partitions. Few businesses care about a few topics with a few partitions and consumers, and they need to be able to accommodate a large number of different consumers with different themes/partitions to map into business use cases.
The specific problems of Confluent’s research method are summarized in the following table.
Many of Confluent’s benchmark problems stem from a limited understanding of Pulsar. To help you avoid these problems when benchmarking later, we share some insights into Pulsar technology.
In order to conduct accurate benchmarking, it is necessary to understand the durability guarantee of Pulsar. We’ll start with this question as a starting point, giving a general overview of persistence in distributed systems, and then showing how Pulsar and Kafka differ in their persistence guarantees.
Overview of distributed System persistence
Persistence refers to the ability to maintain system consistency and availability in the face of external problems such as hardware or operating system failures. Single-node storage systems such as RDBMS rely on Fsync writes to disk to ensure maximum persistence. Operating systems typically cache writes, which can be lost in the event of a failure, but fsync ensures that the data is written to physical storage. In distributed systems, persistence typically comes from data replication, where multiple copies of data are distributed to different nodes that can fail independently. However, local persistence (fsync data) should not be confused with replication persistence, which serves different purposes. We’ll explain the importance of these features and the key differences.
Replication persistence and local persistence
Distributed systems typically have both replication persistence and local persistence. Each type of persistence is controlled by a separate mechanism. These mechanisms can be used in a flexible combination, with different persistence levels set as required.
Replication persistence is achieved by an algorithm that creates multiple copies of data, so the same data can be stored in multiple locations, improving both availability and accessibility. The number of copies, N, determines the system’s fault tolerance, and many systems require “arbitrators” or N/2 + 1 nodes to confirm writes. Some systems can continue to serve existing data while any single copy is still available. This replication mechanism is critical for dealing with complete loss of instance data, and the ability for new instances to replicate data from existing replicas is also critical for availability and consensus (an issue not explored in this section).
In contrast, local persistence determines the different understanding of validation at each node level. Local persistence requires that data be fsynced into persistent storage to ensure that no data is lost in the event of a power outage or hardware failure. Fsync of data ensures that the node has all the data previously identified after the machine recovers from a failure in a short period of time.
Persistence modes: synchronous and asynchronous
Different types of systems provide different levels of persistence assurance. In general, the overall persistence of a system is affected by:
- Whether to fsync data to the local disk
- Whether the system copies data to multiple locations
- When the system confirms replication to the peer
- When the system acknowledges writing to the client
These choices vary widely from system to system, and not all systems support user control of these values. Systems that lack some of these mechanisms, such as replication in a non-distributed system, have lower persistence.
We can define two persistence modes, both of which control when the system acknowledges writes for internal replication and when they are written to the client, namely “synchronous” and “asynchronous.” The two modes operate as follows.
- Synchronous persistence: The system returns a write response to the peer system/client only after the data has been successfully fsynced to local disk (local persistence) or copied to multiple locations (replicated persistence).
- Asynchronous persistence: The system returns a write response to the peer system/client until data is successfully fsynced to local disk (local persistence) or copied to multiple locations (replicated persistence).
Persistence level: Measures the persistence guarantee
Persistence guarantees exist in many forms, depending on the following variables:
- Whether the data is stored locally, replicated in multiple locations, or both
- When to acknowledge write (synchronous/asynchronous)
As with the persistence pattern, we have defined four levels of persistence to distinguish between different distributed systems. Table 6 lists the levels of persistence from highest to lowest.
Most distributed relational database management systems, such as NewSQL databases, guarantee the highest level of persistence, so they are classified as level 1.
Like databases, Pulsar is a Level 1 system that provides the highest level of persistence by default. In addition, Pulsar customizes the required level of persistence individually for each application. In contrast, most Kafka production deployments are configured at level 2 or level 4. Kafka can also reach level 1 by setting flush. Messages =1 and flush. Ms =0. But these two configurations can have a significant impact on throughput and latency, which we’ll discuss in more detail in our benchmark.
Let’s start with Pulsar and explore the durability of each system in detail.
Persistence of Pulsar
Pulsar provides all levels of persistence assurance to copy data to multiple locations and fsync data to local disks. Pulsar has two persistence modes (synchronous and asynchronous, as described above). Users can customize the Settings according to the application scenario, use a single mode, or use a combination of modes.
Pulsar uses a raft equivalent, arbitration-based replication protocol to control replication persistence. The replication persistence mode can be adjusted by adjusting the ACK-quorum -size and write-quorum-size parameters. Table 7 lists the Settings for these parameters, and Table 8 lists the persistence levels supported by Pulsar. (Pulsar replication protocols and consensus algorithms are outside the scope of this article and will be explored further in a future blog.)
Pulsar controls local persistence by writing and/or fsync data to the log disk. Pulsar also provides the option to adjust the local persistence mode with the configuration parameters in Table 9.
Kafka’s persistence
Kafka provides three persistence levels: level 1, Level 2 (the default), and level 4. Kafka provides replication persistence at level 2 and cannot provide a persistence guarantee at level 4 because it does not have the ability to fysNC data to disk before confirming writes. Kafka can reach level 1 system level by setting flush. Messages =1 and flush. Ms =0, but Kafka has rarely deployed this configuration in production environments.
Kafka’s ISR replication protocol controls replication persistence. You can adjust Kafka’s replication persistence mode by adjusting the acks and min.insync.replicas parameters associated with this protocol. Table 10 lists the Settings for these parameters. Table 11 lists the persistence levels supported by Kafka. (A detailed explanation of the Kafka replication protocol is beyond the scope of this article, and we’ll delve into the differences between Kafka and Pulsar in a future blog post.)
Unlike Pulsar, Kafka does not write data to a separate log disk. Kafka confirms the write operation before fsynching the data to disk. This operation minimizes I/O contention between writes and reads and prevents performance degradation.
By setting flush. Messages = 1 and flush. Ms = 0 on each message, Kafka can provide fsync functionality and greatly reduce the likelihood of message loss, but this can severely impact throughput and latency. As a result, this setup is almost never used for production deployments.
Kafka cannot transmit log data and runs the risk of losing data in the event of a machine failure or power outage. This defect is obvious and has a great impact, which is also one of the main reasons for Tencent’s billing system to choose Pulsar.
Differences in durability between Pulsar and Kafka
Pulsar’s persistence Settings are flexible and can be optimized to meet the requirements of each application, application scenario or hardware configuration.
Kafka is not flexible enough to ensure the same persistence Settings on both systems, depending on the scenario. This makes it harder to benchmark. To address this issue, the OMB framework recommends using the closest available Settings.
With that background in mind, let’s take a look at the problems with the Confluent benchmark. Confluent attempted to simulate Pulsar’s fsync behavior, and in their benchmark, Confluent set asynchronous persistence for Kafka and synchronous persistence for Pulsar. This asymmetry leads to incorrect test results and biased performance judgments. Our benchmarks show that Pulsar matches or exceeds Kafka in performance, while Pulsar also offers greater durability.
StreamNative benchmark testing
To get a more accurate picture of Pulsar performance, we need to use the Confluent benchmark to address these issues. We focused on tuning the configuration of Pulsar to ensure the same persistence Settings for both systems and to incorporate other performance factors and conditions, such as different number of partitions and mixed workloads, to measure performance in different application scenarios. We will detail the configuration adjustments in our tests in the following sections.
StreamNative test setup
Our benchmark setup includes all the persistence levels supported by Pulsar and Kafka. This allows us to compare throughput and latency at the same persistence level. The persistence Settings we used are as follows.
Replication persistence Settings
Our replication persistence Settings are the same as Confluent’s and remain unchanged. For completeness, we use the Settings listed in Table 12.
Pulsar’s new feature (new 🙂 provides applications with the option to skip logging, thereby easing local persistence guarantees, avoiding write magnification and improving write throughput. (This feature is available in the next version of Apache BookKeeper). We didn’t make it a default feature, so it’s not recommended for most scenarios because it’s still possible to lose messages.
To ensure accurate performance comparisons between the two systems, we used this feature in our benchmark tests. Logging bypassing Pulsar provides the same local persistence guarantee as Kafka’s default fsync Settings.
New features of Pulsar include local persistence mode (Async-bypass Journal). We use this pattern to configure Pulsar to match the default level of Kafka local persistence. Table 13 lists the specific Settings for the benchmark.
StreamNative framework
We found some problems in the Confluent OMB framework branch and fixed some configuration errors in the OMB Pulsar driver. We developed new benchmark code (including the following fixes), all in the open source repository.
Fix OMB framework issues
Confluent follows the OMB framework recommendations with two sets of examples – one for Kafka and one for Pulsar. In our benchmark, we assigned a set of three instances to enhance the reliability of the test. In our first test, we ran three instances on Pulsar. The same tests were then performed on Kafka using the same set of instances.
We benchmarked different systems on the same machine, clearing the file system page cache before each run to ensure that the current test was not affected by previous tests.
Fixed OMB Pulsar driver configuration issues
We fixed a number of errors in Confluent’s OMB Pulsar driver configuration. The following sections describe the specific adjustments we made to the Broker, Bookie, producer, consumer, and Pulsar Image.
Adjust Broker configuration
Pulsar broker using managedLedgerNewEntriesCheckDelayInMillis parameter, to determine the catch – up subscription before distribute the message to its consumers have to wait for how long (in milliseconds). In the OMB framework, the value of this parameter is set to 10, which is the main reason for the inaccurate conclusion of the Confluent benchmark, which concludes that Pulsar lag is higher than Kafka. We change this value to 0 to simulate the delay behavior of Kafka on Pulsar. After the change, Pulsar’s latency was significantly lower than Kafka’s in all test scenarios.
In order to optimize performance, we will bookkeeperNumberOfChannelsPerBookie parameter value increased from 16 to 64, to prevent any single between broker and bookie Netty channel become a bottleneck. This bottleneck causes high latency when a large number of messages accumulate in the Netty I/O queue.
We will provide clearer guidance in the Pulsar documentation to help users optimize end-to-end latency.
Adjust the Bookie configuration
We added a new configuration for Bookie to test Pulsar performance when logging is bypassed. Pulsar and Kafka are neck and neck in durability guarantees.
To test the performance of this feature, we built a custom image based on the official Pulsar 2.6.1 release to cover this adjustment. (See Pulsar Image for details.)
We manually configured the following Settings to bypass logging in Pulsar.
journalWriteData = false
journalSyncData = false
Copy the code
In addition, we will journalPageCacheFlushIntervalMSec parameter values from 1 to 1000, in the Pulsar for asynchronous locally persistent benchmarking (journalSyncData = false). By increasing this value, Pulsar simulates Kafka’s brush behavior as described below.
Kafka ensures local persistence by flushing the file system page cache to disk. The data is flushed by a set of background threads called PDFlush. Pdflush can be set, and the wait time between swipes is usually set to 5 seconds. The Pulsar journalPageCacheFlushIntervalMSec parameter is set to 1000, equivalent to 5 seconds on Kafka pdflush intervals. With this change, we can benchmark asynchronous local persistence more accurately and compare Pulsar and Kafka more accurately.
Adjusting producer allocation
Our batch configuration is the same as Confluent’s, with one exception: we increased the switch interval to make it longer than the batch interval. Specifically, we will batchingPartitionSwitchFrequencyByPublishDelay parameter values from 1 to 2. This change ensures that Pulsar producers focus on only one partition during each batch.
Setting the switching interval and the batch interval to the same value causes Pulsar to switch partitions frequently, resulting in too many small batches, and possibly throughput. This risk is minimized by setting the switching interval to be larger than the batch interval.
Adjusting consumer profiles
When the application cannot process incoming messages quickly, the Pulsar client applies backpressure using the receiver queue. The size of the consumer recipient queue affects end-to-end latency. Larger queues can prefetch and cache more messages than smaller queues.
These two parameters to determine the receiver of the queue size: receiverQueueSize and maxTotalReceiverQueueSizeAcrossPartitions. Pulsar calculates the size of the recipient queue in the following way:
Math.min(receiverQueueSize, maxTotalReceiverQueueSizeAcrossPartitions / number of partitions)
For example, if the maxTotalReceiverQueueSizeAcrossPartitions set to 50000, when there are 100 partitions, Pulsar, the client will in each partition, general consumer receiver queue size set to 500.
In our benchmark, maxTotalReceiverQueueSizeAcrossPartitions increased from 50000 to 5000000. This tuning ensures that consumers do not apply back pressure.
Pulsar image
We built a custom version of Pulsar (V.2.6.1 – SN-16) that includes the Pulsar and BookKeeper fixes described above. Version 2.6.1-SN-16 is based on the official Pulsar 2.6.1 release, Can be downloaded from https://github.com/streamnative/pulsar/releases/download/v2.6.1-sn-16/apache-pulsar-2.6.1-sn-16-bin.tar.gz.
StreamNative test method
We adjusted the Confluent benchmark testing methodology to get a full picture of performance through actual workloads. Specific adjustments are made to the test as follows:
- Catch-up read is added to evaluate the following
- The maximum throughput each system can achieve while handling catch-up read
- How do writes affect publishing and end-to-end latency
- Change the number of partitions to see how each change affects throughput and latency
- Change the number of subscriptions to see how each change affects throughput and latency
Our benchmark scenario tested the following types of workloads:
- Maximum throughput: What is the maximum throughput each system can achieve?
- Delay: The minimum publishing and end-to-end delay that can be achieved by each system at a given throughput?
- Catch-up read: What is the maximum throughput each system can achieve when reading messages from a large backlog?
- Mixed workloads: What is the minimum release and end-to-end delay delay level each system can achieve when a consumer performs a catch-up operation? How does catch-up read affect publish delay and end-to-end delay?
Testbed
The OMB framework recommends using specific Testbed definitions for instance types and JVM configurations; For producer, consumer, and server side, use the workload driver configuration. Our benchmark uses the same Testbed definition as Confluent. For these Testbed definitions, check out the StreamNative branch in the Confluent OMB repository.
The following highlights the disk throughput and disk fsync latency we observed. These hardware metrics must be considered in order to interpret benchmark results.
Disk throughput
Our benchmark uses the same instance type as Confluent, i3en.2xlarge (with 8 Vcores, 64 GB RAM, 2x 2, 500 GB NVMe SSD). We confirmed that i3en.2Xlarge instances can support throughput of up to 655 MB/s of writes between two disks. See dd results below.
Disk 1 dd if=/dev/zero of=/mnt/data-1/test bs=1M count=65536 oflag=direct 65536+0 records in 65536+0 records out 68719476736 bytes (69 GB) Copied, 210.08s, 327 MB/s Disk 2 dd if=/dev/zero of=/mnt/data-2/test bs=1M count=65536 oflag=direct 65536+0 records in 65536+0 records Out 68719476736 bytes (69 GB) copied, 209.635s, 328 MB/sCopy the code
Disk data synchronization delays
When doing delay-related testing, it is important to capture fsync delays on NVMe SSDS. We observed a P99 fsync delay of between 1ms and 6ms for these three instances, as shown in the figure below. As mentioned earlier, disks vary greatly from case to case, mainly in this delay, and we found a group of instances with consistent delay.
StreamNative benchmark results
Our benchmark results are summarized below. The full benchmark report can be downloaded from StreamNative or viewed in the OpenMessaging-Benchmark repository.
Maximum throughput test
The maximum throughput test is designed to determine the maximum throughput each system can achieve when handling workloads with different durability guarantees, including publishing and Retail-read. We changed the number of topic partitions to see how each change affected maximum throughput.
We found that:
- When the persistence guarantee (synchronous replication persistence, synchronous local persistence) is configured to level 1, the maximum throughput of Pulsar is about 300 Mb/s, which is the physical limit of log disk bandwidth. Kafka can reach around 420 MB/s with 100 partitions. It is worth noting that at level 1 persistence, Pulsar is configured to have one disk used as a log disk for writing and another disk used as a Ledger disk for reading; Kafka uses two disks simultaneously for reading and writing. Although Pulsar’s setup provides better I/O isolation, its throughput is also limited by the maximum bandwidth for a single disk (~ 300 MB/s). Configuring alternate disks for Pulsar enables a more cost-effective operation. This topic will be discussed in a future blog post.
- When persistence (synchronous replication persistence and asynchronous local persistence) is configured at level 2, both Pulsar and Kafka can achieve a maximum throughput of about 600 MB/s. Both systems have reached the physical limits of disk bandwidth.
- Kafka’s maximum throughput on a partition is only half the maximum throughput of Pulsar.
- Pulsar’s throughput is not affected by changing the number of partitions, but Kafka’s throughput is.
- When the number of partitions increased from 100 to 2000, Pulsar maintained maximum throughput (about 300 MB/s with level 1 persistence guarantee and about 600 MB/s with level 2 persistence guarantee).
- As the number of partitions increases from 100 to 2000, Kafka’s throughput drops by half.
Release and end-to-end delay testing
Publishing and end-to-end delay testing is designed to determine the minimum latency that systems can achieve when handling workloads with different persistence guarantees, including publishing and Inventorie-read. We modified the number of subscriptions and the number of partitions to see how each change affects publishing and end-to-end latency.
We found that:
- Pulsar’s publish and end-to-end latency were significantly (hundreds of times) lower than Kafka in all test cases, which evaluated various persistence guarantees and varying numbers of partitions and subscriptions. Even when the number of partitions increases from 100 to 10,000 or the number of subscriptions increases from 1 to 10, both the Pulsar P99 publish latency and end-to-end latency are within 10 milliseconds.
- Changes in the number of subscriptions and partitions can have a huge impact on Kafka’s publishing and end-to-end latency.
- As the number of subscriptions increased from 1 to 10, both publication and end-to-end latency increased from about 5 milliseconds to about 13 seconds.
- As the number of topic partitions increased from 100 to 10,000, both publication and end-to-end latency increased from about 5 milliseconds to about 200 seconds.
Catch – up the read test
The Catch-up Read test is designed to determine the maximum throughput each system can achieve when handling a workload containing only catch-up Read. At the start of the test, the producer sends messages at a fixed rate of 200K per second. After the producer sends 512GB of data, the consumer begins to read the received message. The consumer processes accumulated messages and can keep in sync with the producer as the producer continues to send new messages at the same rate.
When handling catch-up read, Pulsar’s maximum throughput is 3.5 times faster than Kafka’s. Pulsar has a maximum throughput of 3.5 GB/s (3.5 million messages/second) while Kafka has a throughput of only 1 GB/s (1 million messages/second).
Mixed workload testing
Mixing workload testing is designed to determine the impact of cat-up Read on publishing and Retailing read in mixed workloads. At the beginning of the test, the producer sends messages at a constant rate of 200K per second and the consumer consumes messages in retailing mode. After the producer generates 512GB messages, a new set of catch-up consumers is started to read all the messages from scratch. Meanwhile, producers and existing retail-read consumers continue to publish and use information at the same rate.
We tested Kafka and Pulsar with different persistence Settings, and found that catch-up Read significantly affected Kafka’s release latency, but had little impact on Pulsar. The Kafka P99 release latency increased from 5 milliseconds to 1-3 seconds, while the Pulsar P99 release latency remained between milliseconds and tens of milliseconds.
conclusion
Benchmarks typically present only a narrow combination of business logic and configuration options that may or may not reflect actual application scenarios or best practices, which is the tricky part of benchmarking. Benchmarking can be biased by problems with its framework, setup, and research methodology. We found these problems in Confluent’s recent benchmarking.
At the request of the community, the StreamNative team embarked on this benchmark test to provide insight into the true performance of Pulsar. To make the benchmark more accurate, we fixed problems in the Confluent benchmark and added new test parameters to help us delve deeper into how the technologies compare in real use cases.
According to our benchmark results, Pulsar outperforms Kafka in a real-world workload with the same persistence guarantee; In Confluent’s similarly limited test case, Pulsar achieved the same end-to-end throughput as Kafka. In addition, Pulsar had better latency and I/O isolation than Kafka in each of the different test instances (including different subscription counts, topic counts, and persistence guarantees).
As mentioned earlier, neither benchmark is a substitute for real-world workload testing on the respective hardware. Readers are encouraged to test Pulsar and Kafka with their own Settings and workloads to understand how each system performs in a specific production environment. If you have any questions about Pulsar best practices, please contact us directly or feel free to join Pulsar Slack.
In the coming months, we will be launching a series of blogs to help the community better understand and leverage Pulsar for their business needs. We cover Pulsar’s performance across different workloads and setups, how to select and size hardware across different cloud providers and on-premise environments, and how to leverage Pulsar to build the most cost-effective streaming data platform.
About the author:
Sijie Guo is co-founder and CEO of StreamNative. Guo sijie has been working in the messaging and streaming data industry for more than 10 years and is a senior expert in messaging and streaming data. Prior to founding StreamNative, He co-founded Streamlio, a company focused on real-time solutions. He was the technical lead of Twitter’s messaging infrastructure team and co-founded DistributedLog and Twitter EventBus. While at Yahoo, he led the team that developed BookKeeper and Pulsar. He is vice President of Apache BookKeeper and a member of Apache Pulsar PMC.
Penghui Li is a StreamNative Software Engineer and member of Apache Pulsar Committer/PMC. Lee worked for Zhaopin, where he was the prime mover in bringing Apache Pulsar to zhaopin. His career has always revolved around messaging systems and microservices, and he is now fully engaged in the world of Pulsar.
Reference links:
- StreamNative benchmark test report
- StreamNative benchmark tests Github addresses
- Pulsar image download address
- OpenMessaging Benchmarking (OMB) framework
- Why does Tencent billing system choose Pulsar
reading
- Pulsar vs. Kafka: Features, Performance, And Use Cases
- Pulsar vs. Kafka (Part 2) : Cases, Features, communities
- Apache Pulsar and Apache Kafka performance comparison analysis in financial scenarios
Click on thelinkTo download the English version