ApachePulsar officially became an Apache Top-level project in September 2018. ApachePulsar is an enterprise-level publish and subscribe (Pub-sub) messaging system originally developed by Yahoo and opened source in late 2016. Pulsar has been running in Yahoo’s production environment for more than three years, and powers key Yahoo applications such as Yahoo Mail, Yahoo Finance, Yahoo Sports, Flickr, the Gemini advertising platform, and Yahoo’s distributed key-value storage system Sherpa. Since its incubation, Pulsar has gained a lot of attention in the open source community, where developers have worked together to contribute a number of enterprise-class features to Pulsar. These contributions have evolved Pulsar from a messaging system to a streaming data platform integrating messaging, storage, and functional lightweight computing. Apache Pulsar is fundamentally different from other traditional messaging middleware systems. These differences can be summed up as follows:
At the level of message model and API, Pulsar unites two classic application scenarios of message-oriented middleware, Queue and Streaming, based on the storage abstraction of log. Subscription allows users to use a single system to support different application scenarios, essentially enabling data silos between applications and services.
Figure 1:Apache Pulsar’s abstraction of the subscription schema
At the architectural level, Pulsar uses the most advanced separation of computing and Storage, separating the traditional Broker from the Storage, thus turning the original Broker into a stateless service layer. By making the Broker stateless, the Broker and storage can scale independently of each other, and Broker failure recovery can be achieved in a very short time, greatly improving service availability. This layered architecture also allows Pulsar to be easily deployed in container choreographed environments such as Kubernetes, making the most of cloud-native infrastructure.
Figure 2:Apache Pulsar layered architecture
At the storage level, Pulsar uses Apache Bookkeeper as its log storage system, bringing the storage granularity down from the traditional partition granularity to the Segment granularity. Once the storage granularity has been subdivided, partitions are no longer physically bound. Partitioning is more of a logical concept. A partition can be divided into fine-grained fragments that are evenly scattered across the whole cluster. This greatly maximizes the possibility of data placement and reduces the complexity of cluster expansion and fault recovery. Tiered Storage features the ability to store historical data on cheaper Storage devices (Ali Cloud OSS, AWS S3, etc.), significantly reducing the cost of storing historical data for an enterprise while ensuring the performance of hot data.
Figure 3: Segment-based sharding for Apache Pulsar
Figure 4: Historical data relocation at Apache Pulsar Tiered Storage
At the beginning of 2018, Zhaopin.com planned to build its own platform-level event center to achieve unified event management and storage. RabbitMQ supports online delivery of messages. Kafka is mainly used in streaming, batch and log processing scenarios. During the application process, we encountered some pain points:
-
The high maintenance cost of the two products
-
Data consistency between two systems
-
Data storage is fragmented and opaque
The need to create a platform-level event center was imminent, and after a lot of technical research work, Apache Pulsar’s layered abstraction, storage design, and multi-tenant, multi-subscription model attracted us. After sufficient learning and communication with the core members of the Apache Pulsar team, the final technology selection was successfully concluded, and Apache Pulsar became our first choice to build a platform-level event center.
The multi-tenant feature provides a better event management solution for the platform-level event center. The multi-tenant feature can be used for user resource isolation and permission control. We can maintain a set of platform-level services to serve the whole business line of Zhaopin, which greatly reduces the operation and maintenance cost. Access parties can apply for their own namespace in the event platform, which is transparent to users and users do not have to worry about maintenance.
Figure 5:Apache Pulsar multi-tenant Topic management
The unification of Queue mode and Streaming mode can well support the requirements of online business work Queue, Streaming processing and batch processing. The event sender only needs to produce one piece of data, which can be used by multiple business parties and various working modes. There is no need to worry about data consistency. Significantly reduced system overhead and data check work.
Figure 6: Unification of Apache Pulsar Queue mode and Streaming mode
The Retention mechanism can well match the requirements of event recall. We can evaluate Retention strategies according to the importance or time value of different events. It is also very convenient to limit the time and size of the strategy in practical application scenarios.
Figure 7:Apache Pulsar’s handling of message retention and message expiration
Apache Pulsar’s enterprise-class feature cross-machine room replication also ensures the disaster recovery capability of the event center. Through this feature, we can store important events in multiple machine rooms to provide data disaster recovery capability.
TieredStorage feature provides good support for cold data storage of events. We can Offload the data that needs to be stored for a long time to secondary storage, such as Ali Cloud OSS, AWS S3 and other products. This can greatly reduce the storage cost of cold data without affecting the performance of hot data. We use dual SSDS to store hot data for Bookkeeper Journal and Leger, which ensures better write and read performance of events.
On top of these features, Zhaopin.com can achieve transparency of events to users at the event center of the whole platform by strictly controlling event definitions and coordinating with Pulsar’s Shema feature. In the event center, users can find existing events on the platform and how they are defined, and the platform has specific requirements for changing events. This also makes sense for data products.
The platform-level event center can provide good basic capability support for online business, streaming computing, batch processing and even artificial intelligence. It is also one of the important platform-level projects of Zhaopin.com in 2018. In August 2018, the project was officially launched, and gradually business parties started to join in. Up to now, the event center provides 500 million event delivery services every day, and it is expected that the low access volume will increase to 2 billion daily event delivery volume in November.
Zhaopin is also continuing to contribute new features to Apache Pulsar, such as Dead Letter Topic, Client Interceptors and other good features will also be available for Pulsar users with the release of version 2.2.0. We are also planning to bring more great features to the community, such as Delay Messages, and thank Streamlio for their support along the way.