About the Apache Pulsar

Apache Pulsar is the top project of Apache Software Foundation. It is the next generation cloud native distributed message flow platform, integrating message, storage and lightweight functional computing. It adopts the architecture design of computing and storage separation, supports multi-tenant, persistent storage, and multi-room cross-region data replication. It has strong consistency, high throughput, low latency, and high scalability. GitHub address: github.com/apache/puls…

About ApacheCon Asia

ApacheCon Asia is the first ApacheCon online conference organized by the ApacheCon Organizing Committee for the ApacheCon region. The main goal is to better serve the rapid growth of Apache users and contributors in ApacheCon. ApacheCon Asia 2021 will be held online from August 6-8, 2021.

The ApacheCon Asia 2021 team officially announced the agenda of the conference. The Apache Pulsar community actively participated in this annual open source event. You can see the topics of Apache Pulsar community members in the special sessions of messaging system, big data, stream processing and so on. The technical issues are listed below for your reference.

Big data

Topic Introduction: Learn how to use Hashicorp Vault to build an authentication and authorization system for Apache Pulsar. Vault provides a secure way to generate tokens and store sensitive data, while Pulsar has a pluggable architecture for authentication, authorization, and key management. This lecture will introduce how to build an authentication and verification system for Pulsar based on Vault, including the following points:

  • Building flexible authentication based on Vault ensures Pulsar clusters can easily access various systems such as LDAP
  • How to implement vault-based application role service account

Sharing Guests: Guangning O is an Apache Pulsar Committer, Apache Pulsar IO and Apache Pulsar Manager. He is currently a senior software Engineer at StreamNative. It specializes in cloud platform, cloud computing and big data related fields.

Stream processing

Introduction to structured Data Flow issues: Type safety is extremely important in any application built around streams/queues. Type definitions and evolution can be built in an application or supported by a data layer, allowing the application to focus only on the business logic and not on how the data is stored and evolved. It is this feature that allows traditional relational databases to survive the challenges of modern NoSQL databases. Asynchronous communication (via streams/queues) is essential in modern software architectures. When data storage and query designs change with asynchronous communication, type safety remains important.

In this talk, we will discuss ways to create a schema on stream data, using Apache Pulsar as an example. Apache Pulsar provides server-side and client-side support for structured flow processing. We have been using Pulsar in production for asynchronous communication between microservices for over 1.5 years.

This talk covers what a Schema is, how to represent a Schema, what the Apache Pulsar server and client provide, how we built our use cases using Pulsar’s Schema support, and the lessons learned and technical details.

Shiv is a Senior software developer at Nutanix, working on the Beam team to help Nutanix customers minimize cloud costs and security risks associated with hybrid cloud use. Shiv likes to spend time on data storage (databases, data streams, analytics, etc.) and has contributed to MySQL and Pulsar code bases. Shiv is an avid reader (technology, fiction, economics, etc.) and is always looking for ways to simplify software architecture.

Introduction: In this talk, I introduce a technique for deploying machine learning models to provide real-time predictions using Apache Pulsar Functions. To provide real-time predictions, the model typically receives a data point from the caller and expects to provide an accurate prediction within milliseconds. Throughout this share, I’ll show you the steps required to make a fully trained ML that can predict delivery times based on real-time traffic information, the customer’s location, and the restaurant where the order will be filled.

David Kjerrumgaard, author of Pulsar in Action, is the lead software engineer on Splunk’s messaging team, responsible for Splunk’s internal PulsAR-as-A-Service platform. Prior to Splunk, he was director of Solution Architecture at Streamlio, where he was responsible for developing best practices and solutions based on Apache Pulsar.

The messaging system

Topic Summary: To take full advantage of the best performance characteristics of streaming backend technology, it is important to understand the ins and outs of how streaming servers store data. With this in mind, you can design scenarios that make the best use of the resources at hand and get the best consistency, availability, latency, and throughput for the resources at hand.

In this talk, we will explore Apache Pulsar’s storage layer (Apache BookKeeper), the basics of BookKeeper’s storage semantics, how it can be used in different scenarios (even outside of Pulsar), learn about Pulsar’s storage object model, The different types of data structures and algorithms Pulsar uses in them, and how to map to the storage class semantics provided by Pulsar by default. Of course, you can also change the storage backend with some extra code. This talk will provide you with the background to properly process data using Pulsar. This presentation will focus on the storage back end so that principles and knowledge can be applied to different data storage or streaming systems in addition to Pulsar.

Sharing Guest: Shivji Kumar Jha, Senior Software developer at Nutanix, works on the Beam team to help Nutanix customers minimize cloud costs and security risks for hybrid cloud use. Shiv’s work includes all of Nutanix’s Pulsar, managing four Pulsar clusters (30 nodes) and use cases around it. Shiv likes to spend time on data storage (databases, data streams, analytics, etc.) and has contributed to MySQL and Pulsar code bases. Shiv is an avid reader (technology, fiction, economics, etc.) and is always looking for ways to simplify software architecture.

BIGO’s Best Practices for Apache Pulsar Powered by AI technology, BIGO’s video products and services, such as BIGO Live and Likee, have gained huge popularity with users in more than 150 countries and regions around the world. Bigo Live is available in more than 150 countries and Likee has more than 100 million users and is popular with Gen Z. Over the past few years, we have deployed a number of Kafka clusters to support real-time ETL and short video recommendations. Apache Pulsar’s layered architecture and new features such as low latency, horizontal scaling, and multi-tenancy help us solve many of the problems in production. We have used Apache Pulsar to build message processing systems, particularly for real-time ETL, short video recommendations, and real-time data reporting.

In this talk, I will share our experience with KoP (Kafka-on-Pulsar) and explore how to migrate seamlessly from Kafka to Pulsar, especially in terms of improved performance and stability. I’ll also share other major application scenarios for Apache Pulsar in BIGO, such as multi-million scale themes, real-time machine learning, and integration with Flink and Flink SQL.

Hang Chen, Apache Pulsar Committer, Leader of BIGO Messaging Platform team, is responsible for creating a centralized Pub-Sub messaging platform that provides a large amount of service/application traffic. He brought Apache Pulsar to the BIGO messaging platform and integrated it with upstream and downstream systems such as Flink, ClickHouse, and other internal systems for real-time recommendation and analysis. He focuses on Pulsar performance tuning, new feature development and integration of the Pulsar ecosystem.

From Apache Kafka to Apache Pulsar – A Survival Guide for Migrating systems In this talk, after a brief, high-level architectural comparison of Kafka and Pulsar, we’ll focus on comparing the messaging/usage models between Kafka and Pulsar, their similarities and differences, and their corresponding implications for application design and implementation. Finally, we’ll look at the different migration options, patterns, and tools available to achieve a seamless application migration path from Kafka to Pulsar.

Guest speaker: Meng Yabin, DataStax lead architect. In recent years, his focus has been primarily on the design and consulting of large, distributed database and stream processing system solutions. Prior to joining DataStax, he spent most of his career focused on system design, implementation and consulting in the areas of relational databases, data warehousing, business intelligence, and NoSQL databases.

Topic introduction: Federated learning (FL) is a machine learning technique that enables multiple decentralized organizations to train a model without exposing a local sample of data. In the course of federated learning and training, participants also exchange a large amount of encrypted information to summarize and form a global model. Because of the importance of messaging, and the need for real-time and sequential messaging, it presents some transport challenges. In this presentation session, we will explore how to address these challenges with the Apache Pulsar program and detail the popular federal learning program FATE(github.com/FederatedAI…). How to use Pulsar for joint training.

Guest speaker: Chen Jiahao, VMware engineer

ELK+Apache Kafka is a common architecture for logging scenarios. Today, however, things are changing as cloud native becomes popular and microservices architectures are being adopted everywhere. This has led to more services and more logs and categories. Apache Kafka cannot meet all the requirements of the cloud native log scenario, such as simple operation, million topic management, and lease resource isolation. Apache Pulsar is a better solution with a cloud-native architecture and better performance. This presentation highlights Apache Pulsar as a new log messaging solution, including requirements for log messaging systems, Kafka vs. Pulsar solutions, Pulsar best practices, and An introduction to Pulsar Functions/ connectors.

Bin Wei is a StreamNative solutions engineer with extensive experience in ELK, Apache Kafka, Apache Pulsar, Prometheus and other big data technologies.

Topic introduction: Apache Pulsar has been widely used in Tencent Cloud. Message queuing in cloud native environment faces many challenges, and Pulsar is a better solution. In this presentation, we will introduce some practical experience of Pulsar in cloud native environment, such as how to rapidly and dynamically expand the capacity, how to improve the utilization of cluster resources, cluster form and so on.

Lin Lin, Senior engineer of Tencent Cloud, Apache Pulsar Commiter, focuses on middleware, has rich experience in message queue, micro-service and other aspects. He joined Tencent in 2019 and is now responsible for the construction of Tencent Cloud TDMQ, committed to creating stable, efficient and scalable underlying components and services.

Apache Pulsar, as the next generation cloud native distributed message flow platform, integrates message, storage and functional computing, and adopts the architecture of storage and computing separation. Apache Pulsar has successfully supported a large number of data and traffic business scenarios in Tencent Cloud. This Topic will share Tencent cloud’s best practice and operation and maintenance experience under Apache Pulsar’s million-magnitude Topic.

Ran Xiaolong, who joined Tencent in 2020, is now responsible for the construction of Tencent Cloud TDMQ, committed to creating stable, efficient and scalable underlying components and services.

RBAC(Role-based Access Control) is a method of controlling system access based on the Role of a single user. RBAC uses the mapping between users and roles and the permissions of each role to determine whether a user can perform operations on certain resources. Apache Pulsar uses Casbin to implement the RBAC authorization method. By enabling the RBAC authorization method, you can manage which roles a user belongs to and what permissions that role has on a resource. This presentation focuses on RBAC authorization in Apache Pulsar. I will explain basic RBAC concepts and principles of Casbin, how to use Casbin Provider to enable RBAC authorization for Pulsar, and how to use RBAC to set and manage permissions in Pulsar. And how to use the Zookeeper Adapter for RBAC in Pulsar.

Speaker: Yang Zikue, software engineer at StreamNative. He has been involved with the Pulsar community since 2020.

Topic brief introduction: Huawei Cloud IoT Platform is the first competitive IoT platform in China, currently managing more than 300 million devices. This talk will introduce: why huawei Cloud iot changed message queue from Kafka to Pulsar? How to use Pulsar in Huawei Cloud iot, and relevant problems encountered in the process of use and corresponding solutions.

He Zhangjian, graduated from Xidian University in 2017. He has been working in the Iot department of Huawei since 2017.

Sign up for ApacheCon Asia 2021

ApacheCon Asia 2021 is now open for registration. Please click “link: hdxu.cn/Q7LkI” to register!