About the Apache Pulsar

Apache Pulsar is the top project of Apache Software Foundation. It is the next generation cloud native distributed message flow platform integrating message, storage and lightweight functional computing. It adopts the architecture design of computing and storage separation, supports multi-tenant, persistent storage, and multi-room cross-region data replication. It has strong consistency, high throughput, low latency, and high scalability. GitHub address: github.com/apache/puls…

This article was translated from The blog StreamNative by Matteo Merli and Guo Sijie. The translator is Jipei@StreamNative. The original address: streamnative. IO/en/blog/rel…

Since its birth and graduation from the Apache Software Foundation, Apache Pulsar has grown into a complete project and formed a global community of tens of thousands of people. This year, with the conclusion of the Pulsar North America Summit and the release of Pulsar version 2.8.0, Pulsar has reached a new milestone in both the project and the community. We take this opportunity to review the technological and ecological development of Pulsar.

Apache Pulsar was born

Apache Pulsar is widely adopted by hundreds of companies worldwide, including Splunk, Tencent, Verizon, and Yahoo! JAPAN, etc. Apache Pulsar began as a cloud-native distributed messaging system and has evolved into a complete messaging and streaming platform for publishing and subscribilizing, storing and processing large data streams in real time.

Pulsar was launched in 2012 as a way to integrate other messaging systems within Yahoo and build a unified logic messaging platform that supports large clusters and cross-regions. Other messaging systems at the time, including Kafka, failed to meet Yahoo’s needs, such as large cluster multi-tenant, reliable IO quality of service, million-level Topic, and cross-geographical replication.

At the time, there were generally two types of systems for handling dynamic data: message queues that handled critical business events in real time, and flow systems that handled scalable data pipelines on a large scale. Many companies have had to limit their capabilities to one or the other, or have had to adopt many different technologies. Choosing more than one technique often results in data isolation and data silos: one island for message queues to build application services and another for flow systems to build data services. Corporate infrastructure is often extremely complex, and the diagram below illustrates this architecture.

However, with the increase in the types of data companies need to process, operational data (such as log data, click events, etc.), and the number of downstream systems that need to access composite business data and operational data, systems need to support both message queue and flow scenarios.

Beyond that, companies need an infrastructure platform that allows them to build all their applications on top of it, and then let those applications process dynamic data (message and stream data) by default. In this way, the real-time data infrastructure can be significantly simplified, as shown below.

With this vision, Yahoo! The team began working on building a unified message flow platform for dynamic data. Here are the key milestones for Pulsar since its inception.

Milestone 1: Scalable storage for data flows

Pulsar was born with Apache BookKeeper. Apache BookKeeper implements a log-like abstraction for continuous stream and provides the ability to run it on an Internet scale using a simple write-read logging API. Logging provides a good abstraction for building distributed systems, such as distributed databases and publish-subscribe messages. The write API comes in the form of an attachment to the log. The read API reads continuously from the starting offset defined by reader. The foundation is laid with the implementation of BookKeeper – an extensible log-based message and flow system.

Milestone 2: Multi-tier architecture with separate storage and computing.

A stateless service layer is introduced on top of extensible log storage to publish and consume messages by running stateless brokers. This multi-tier architecture separates services/computing from storage, allowing Pulsar to manage services and storage in different layers.

This architecture ensures immediate scalability and higher availability, making Pulsar ideal for building mission-critical services such as billing platforms in financial scenarios, transaction processing systems for e-commerce and retailers, and real-time risk control systems for financial institutions.

Milestone 3: Unified messaging model and API

In modern data architectures, real-time scenarios can generally be divided into two categories: queues and streams. Queues are typically used to build core business application services, and flows are often used to build real-time data services, such as data pipes.

To provide a platform that can serve both applications and data services, a unified messaging model that integrates queue and flow semantics is needed. Pulsar Topic became a real source of consumption. Messages can only be stored once on a topic, but can be consumed in different ways through different subscriptions. This unification greatly reduces the complexity of managing and developing message and flow applications.

Milestone 4: The Schema API

Next, a new Pulsar Schema Registry and a new security type Producer and Consumer API were added to Pulsar. The built-in Schema Registry enables message producers and consumers on Pulsar topics to coordinate the structure of the topic data through the Pulsar broker itself, without the need for external coordination mechanisms. With Schema data, every piece of data transmitted through Pulsar is fully discoverable, enabling users to build systems that easily adapt to changes in data.

In addition, the Schema Registry tracks data compatibility between schema versions. As the new schema is uploaded, Registry ensures that the old consumer can read the new schema version to ensure that the producer cannot break the consumer.

Milestone 5: Functions and IO apis

The next step is to build an API that makes it easy to input, output and process data from Pulsar. The goal is to use Apache Pulsar to easily build event-driven applications and real-time data pipelines that users can process as events arrive, no matter where they come from.

The Pulsar IO API allows users to build real-time streaming data pipelines by plugging in various source connectors that input data from external systems to Pulsar and sink connectors that output data from Pulsar to external systems. Currently, Pulsar offers several built-in connectors that users can use right out of the box.

In addition, StreamNative maintains the StreamNative Hub, the Registry of Pulsar Connectors, which offers dozens of connectors that integrate with popular data systems. If the IO API is used to build streaming data pipelines, the Functions API is used to build event-driven applications and real-time streaming processors.

The concept of serverless Functions was used for stream processing, and then the Functions API was built as a lightweight serverless library where users could write any event, processing logic in any language. Teams of engineers can write flow processing logic without having to run and maintain another cluster.

Milestone 6: Unlimited storage for Pulsar through tiered storage

As Apache Pulsar grew in popularity and the amount of data stored in Pulsar increased, users eventually hit a “retention cliff,” where storing, managing, and retrieving data in Apache BookKeeper became more expensive. To solve this problem, operations engineers and application developers often use external storage such as AWS S3 as a sink for long-term storage, however this loses most of the benefits of Pulsar’s immutable flow and sorting semantics, and users end up having to manage two systems with different access patterns.

The introduction of tiered storage enables Pulsar to offload most of the data to remote cloud native storage. This cheaper form of storage can easily scale with the volume of data. More importantly, through tiered storage, Pulsar provides the batch storage capabilities needed when integrating with batch streaming converged processors such as Flink. Batch streaming integration with Pulsar enables companies to quickly and easily query historical real-time streams, increasing their competitive advantage.

Milestone 7: Plug-in protocol

With the introduction of tiered storage, Pulsar evolved from a Pub/Sub messaging system to a scalable streaming data system that can receive, store, and process data streams. However, existing applications written using other messaging protocols (such as Kafka, AMQP, MQTT, etc.) must be rewritten to adopt Pulsar’s messaging protocol.

The plug-in protocol API further reduces the overhead of building a message flow with Pulsar, and developers can take advantage of all the advantages offered by the Pulsar architecture to extend Pulsar functionality to other message domains. StreamNative has partnered with other industry leaders to develop popular plug-in protocols, including:

  • Kafka-on-pulsar (KoP), opened source by OVHCloud and StreamNative in March 2020;
  • Amqp-on-pulsar (AoP), opened source by China Mobile and StreamNative in June 2020;
  • Mqtt-on-pulsar (MoP), opened source by StreamNative in August 2020;
  • Rocketmq-on-pulsar (RoP), opened source by Tencent Cloud and StreamNative in May 2021.

Milestone 8: Transaction API for exactly-once flow processing

Recently, transactions were added to Apache Pulsar to enable the exact-once semantics for flow processing. This basic skill provides a strong guarantee for streaming data transformation, making it easy to build scalable, fault-tolerant, stateful message flow applications to handle streaming data.

In addition, transaction API functionality is not limited to existing client languages. Pulsar’s support for transactional message flows is a protocol-level capability that can be rendered in any language. Such protocol-level functionality is available for a variety of applications.

Build a unified message flow ecosystem

In addition to continuing to upgrade Pulsar technology, the community is also committed to building a strong surrounding ecosystem. Pulsar supports rich pub-Sub libraries, connectors, functions, plug-in protocols, and an ecosystem integrated with popular engines, enabling Pulsar users to simplify workflows and apply them to new scenarios.

reading

  • New features in Pulsar 2.8.0: Producer exclusivity, transactions, etc
  • Community Events | Following the North American Summit, the first Pulsar Summit European Summit 2021 launches the Call for Speakers!

Want to connect with the community? Scan the QR code of Pulsar Bot below to join Pulsar wechat communication group

Click on thelink, get Apache Pulsar hardcore dry goods information!