This article by”
AI the front“Original, original
Why Apache Pulsar: IO Isolation


By Matteo Merli & Karthik Ramasamy


Translator | Xue Mingdeng


Edit | Emily

This is the second article in a series on Apache Pulsar’s key features. Pulsar is the next generation publish and subscribe messaging system developed and open sourced by Yahoo. In the first article, “Why Apache Pulsar”, we introduced Pulsar’s flexible support for message models, multi-tenancy, multi-locale replication, and persistence. In this article, we will continue to introduce Pulsar’s IO isolation mechanism, scalability, security model, multi-language API, and ease of operation.”


Read/write I/O isolation

In most messaging systems, consumer speed delays cause performance degradation. Consumers on the same topic, if one of them has a speed delay, it will affect other faster consumers. This is because slow consumers force the messaging system to fetch data from storage, resulting in I/O jitter and reduced throughput. Consumers who need to load data into memory first will be affected. The main reason for this problem is that read and write operations share the same execution path.

Pulsar solves this problem by using BookKeeper as a storage system. With BookKeeper, Pulsar can implement IO isolation by providing different execution paths for read and write operations. Regular reads are handled directly by the Pulsar broker, writes use BookKeeper’s write-ahead log (WAL), and catch-up reads use BookKeeper’s back-end storage device.

It is important to provide controlled, predictable delays in all cases for the release of messages in the system. I/O isolation ensures low and predictable delay in publishing messages, even when the disk is subjected to high load reads.

scalability

Scalability of messaging systems is also important. The scalability of publish-subscribe messaging systems can be measured in the following dimensions:

High throughput — The high throughput of Pulsar is achieved by maximizing disk bandwidth (IOPS). Throughput depends on the size of the message, for example, Pulsar can achieve throughput of 120MB per second if the message size is 1KB. But if the message is small, say 10 bytes, the throughput might be only 1.8 megabits per second. Both results are based on a single publisher writing messages to a topic partition, with 99% of write delays being less than 5 milliseconds.

Number of topics – Topic scalability is the ability of publish-subscribe messaging systems to support a large number of topics. Pulsar can support scaling of the number of topics from hundreds to millions of levels while maintaining good performance. The scalability of a topic depends on how the data is organized and stored. If data for a topic is kept in a separate file or directory, scalability is affected because disk IO is scattered and files are periodically flushed from the page cache onto disk. However, Pulsar data is stored in a Bookie (BookKeeper server), where messages on different topics are aggregated, sorted, stored in large files, and indexed. Pulsar is therefore able to support a wide range of topics.

security

Pulsar supports pluggable authentication mechanisms that can be configured to use multiple authentication mechanisms. The purpose of authentication is to establish a client identity and assign a role token to the client. This token is used to determine what a client can do. Pulsar provides two default authentication implementations: TLS client authentication and AthenZ, a role-based authentication system developed by Yahoo.

Multilingual API

Applications can interact with Pulsar in a variety of programming languages. Pulsar provides official clients for three major languages: C++, Java, and Python. You can select a client as required. These client apis are intuitive and consistent across languages. In addition, Pulsar’s official client provides both synchronous and asynchronous read and write operations for different styles of applications. Synchronous and asynchronous semantics are the same: either block the method and wait for the operation to complete, or return a Future object to trace whether the operation has completed.

Operational maturity

Pulsar is easy to manage and you can expand capacity by adding new broker nodes and storage nodes while the system is running. If a storage node goes down, background processes automatically copy the data contained in the failed node from available copies on other nodes to the available storage node. Pulsar provides a variety of ways to manage clusters, either using command-line tools, Java libraries or REST apis. The latter offer greater flexibility, and you can build your own management tools on top of them or use them in conjunction with existing tools.

enterprise

Yahoo designed Pulsar to address some of the issues and scenarios with existing open source messaging systems. High throughput (processing hundreds of billions of messages), strong persistence guarantees, and low latency requirements are met. Pulsar has been running at Yahoo for three years, supports 1.4 million topics with 99% latency less than 5 milliseconds, is deployed in over 10 data centers (with full grid replication enabled), and has handled over 100 trillion messages.

conclusion

Pulsar is the next generation publish and subscribe messaging system, complementing other open source messaging systems. In these two articles, we introduced various key features of Pulsar and explained how Pulsar achieves IO isolation through the use of broker and Bookie, as well as how it supports enterprise-level features such as persistence, multi-tenant, multi-region replication, high throughput, and topic scaling.

Streaml.io /blog/ why-AP…

For more content, you can follow AI Front, ID: AI-front, reply “AI”, “TF”, “big Data” to get AI Front series PDF mini-book and skill Map.