This paper compares Heron with common Streaming projects, including Storm, Flink, Spark Streaming and Kafka Streams, and summarizes the key points of system selection. In addition, we practiced a case study of Heron and discussed the new features Heron developed during the year.

In the “Basics” in June this year, we have had a basic understanding of Heron’s design and operation by learning the basic concepts, overall architecture and core components of Heron[1][2][3]. In this installment of Applications, We compared Heron with other popular real-time Streaming systems (Apache Storm[4][5], Apache Flink[6], Apache Spark Streaming[7], and Apache Kafka Streams[8]). On this basis, we introduce how to select the system in practical application. Then we’ll share a simple case application. We’ll end with a look at Heron’s progress in 2017.

Comparison and selection of real-time stream processing systems

Current popular real-time Streaming systems include Apache Storm, Apache Flink, Apache Spark Streaming, Apache Kafka Streams and other projects under the Apache Foundation. Although both belong to the real-time stream processing category with Heron, they have their own characteristics.

Heron vs. Storm (including Trident)

Inside Twitter, Heron replaces Storm and is the standard for stream processing.

Data model differences

Heron is compatible with Storm’s data model, or Heron is compatible with Storm’s API, but the implementation behind it is completely different. So the scenarios are the same, Heron is used where Storm is used. But Heron offers better efficiency, more features, more stability and easier maintenance than Storm.

Storm Trident is a project based on Storm and provides a high-level API like Heron’s functional API. Trident realizes exactly once by checkpoint and rollback; Heron implements Effectively Once with the distributed snapshot algorithm invented by Chandy and Lamport.

Application architecture differences

Storm’s worker runs multiple threads per JVM process, performing multiple tasks per thread. The logs of these tasks are mixed together, making it difficult to debug the performance of different tasks. Storm’s Nimbus cannot isolate worker resources, so resources in multiple topologies affect each other. ZooKeeper is also used to manage heartbeat, which makes it easy to become a bottleneck.

Each Heron task is a separate JVM process, facilitating debugging and resource isolation management while saving resources for the entire topology. ZooKeeper stores only a small amount of data in Heron. Heartbeat is managed by the Tmaster process and has no pressure on ZooKeeper.

The Flink framework includes both batch and stream processing capabilities. The core of Flink adopts the stream processing mode, and its batch processing mode is obtained by simulating the stream processing form of block data.

Data model differences

Flink adopts the DECLARative API mode in API. Heron provides both the declarative mode or functional API and the underlying compositional mode API, as well as apis for Python[9] and C++[10].

Application architecture differences

In terms of operation, Flink can be configured in a variety of ways. In general, it adopts the hybrid mode of multi-task and multi-thread in the same JVM, which is not conducive to debugging. Heron uses a single-task, single-JVM model that facilitates debugging and resource allocation.

In terms of resource pools, both Flink and Heron can work with multiple resource pools, including Mesos/Aurora, YARN, Kubernetes, and more.

Heron vs Spark Streaming

Spark Streaming processes tuples in a micro-batch granularity. Tuples in a time window of half a second to several seconds are submitted to Spark as a micro-batch. Heron uses a processing granularity of tuple. The average response time of Spark Streaming can be considered as the length of half a time window due to time window limitations, which Heron does not have. So Heron is low latency and Spark Streaming is high latency.

Spark Streaming recently announced a proposal to include a new mode in the next version 2.3 that does not use micro-batch for calculations.

Data model differences

Semantically, Spark Streaming and Heron both implement exactly once/effectively once. Spark Streaming and Heron both implemented stateful processing at the state level. Spark Streaming supports SQL, while Heron does not. Spark Streaming and Heron both support Java and Python interfaces. It should be noted that Heron’s API is pluggable and supports many programming languages besides Java and Python, such as C++.

Application architecture differences

For task assignment, Spark Streaming uses a single thread for each task. A JVM process may have multiple threads running at the same time. Heron is a separate Heron-instance process for each task, which is designed to be easy to debug because when a task fails, the task process can only be taken out for inspection, avoiding the interaction of the various task threads in the process.

In terms of resource pools, Spark Streaming and Heron can run on YARN and Mesos. It should be noted that Heron’s resource pool design is modeled on pluggable Interface, which can connect to many resource managers such as Aurora, etc. You can see [11] for the resource pools supported by Heron.

Heron compares Kafka Streams

Kafka Streams is a client-side library. From this call library, applications can read message flows in Kafka for processing.

Data model differences

Kafka Streams is bound to Kafka and requires a subscription topic to get message flows, which is completely different from Heron’s DAG model. For DAG mode flow computation, the nodes of DAG are controlled by the flow computation framework, and the user computation logic needs to be submitted to these frameworks according to the DAG mode. Kafka Streams does not have these presets, so the user’s computing logic is completely controlled by the user and does not have to follow the DAG model. Kafka Streams also supports back pressure and stateful processing.

Kafka Streams defines two abstractions: KStream and KTable. In KStream, each pair of key-values is independent. In KTable, key-values are resolved as sequences.

Application architecture differences

Kafka Streams is built entirely on Kafka and is very different from Heron and other stream processing systems. The calculation logic of Kafka Streams is completely controlled by the user program, which means that the logic of the stream calculation does not run in a Kafka cluster. Kafka Streams can be understood as a connector that reads and writes key sequences from a Kafka cluster, computationally required resources, task life cycles, and so on, all managed by user programs. Heron can be understood as a platform. After the user submits the topology, Heron does the rest.

The selection

Based on the comparison of the above tables, we can get the following key points of selection:

  • Storm is suitable for scenarios that require rapid response and moderate traffic. Storm and Heron are API compatible and basically interchangeable in functionality; Twitter moved from Storm to Heron, indicating that if Storm and Heron had to choose between them, they would usually choose Heron.

  • Kafka Streams is bound to Kafka. If your existing system is built on Kafka, consider using Kafka Streams to reduce overhead.

  • Spark Streaming is thought to have the highest traffic of these projects, but it also has the highest response latency. Spark Streaming can be used for systems that require low response speed but high flux. If you take this situation to the extreme you can use the Spark system directly.

  • Flink uses a stream processing kernel and provides both stream and batch interfaces. Flink is suitable for projects that require both stream and batch processing. At the same time, due to the need to take into account the trade-offs of both sides, it is not easy to carry out targeted optimization and processing in a single aspect.

Spark Streaming, Kafka Streams, and Flink all have specific application scenarios, and Heron can be used for other general stream processing scenarios.

Heron case study

Let’s try to run a sample topology on an Ubuntu machine, which includes the following steps:

  • Install the Heron client, start a Heron sample Topology, and the other topology operates commands.

  • Install the Heron toolkit, run Heron Tracker, and run Heron UI.

Run the topology

Find Heron’s publishing page at github.com/twitter/her… Find the latest version, 0.16.5. You can see that Heron provides several versions of the installation files, which are divided into several categories: client, toolkit Tools, development kit API, and so on.

Installing the Client

Download the client installation file heron-client-install-0.16.5-ubuntu.sh:

Wget HTTP: / / https://github.com/twitter/heron/releases/download/0.16.5/heron-client-install-0.16.5-darwin.sh

Then execute this file:

Chmod +x heron-*. Sh./heron-client-install-0.16.5-- platform. sh --userCopy the code

The –user argument installs the Heron client into the current user directory ~/.hedon and creates a link under ~/bin to the executable under ~/.heorn/bin.

The Heron client is a command line program named Heron. The heron command can be accessed directly by exporting PATH=~/bin:$PATH. Run the following command to check whether the heron command was successfully installed:

heron version

Run the example Topology

Add localhost to /etc/hosts. Heron uses /etc/hosts to resolve local domain names in single-machine mode.

The Heron client was installed with a jar for the example Topology package in the ~/.heron/example directory. We can run one of the examples Topology as an example:

heron submit local ~/.heron/examples/heron-examples.jar \com.twitter.heron.examples.ExclamationTopology ExclamationTopology \--deploy-deactivatedCopy the code

The Heron submit command submits a topology to heron to run. The heron submit command format can be viewed with the Heron help Submit command.

When Heron runs in standalone local mode, it stores information such as running status and logs in the ~/. Herondata directory. We can look at the example Topology directory we just ran:

ls -al ~/.herondata/topologies/local/${USER_NAME}/ExclamationTopology

Topology lifecycle

The lifecycle of a topology consists of the following phases:

  • Submit: Submits the topology to the Heron-Scheduler. The topology has not yet handled the tuples, but it is ready to be activated;
  • Activate /deactivate: Tells the topology to start/stop processing the tuples;
  • Restart: Restarts a topology so that the resource manager can re-allocate the container.
  • Kill: Cancels the topology to release the resource.

These phases are managed through the Heron command line client. You can view the command format in heron help.

Heron toolkit

The Heron project provides tools to easily view the topology state running in the data center. We can also try these tools in local machine mode. These tools mainly include:

  • Tracker: A server that provides a restful API to monitor the runtime status of each topology.
  • UI: A website that is displayed as a web page by invoking the Tracker restful API.

A toolkit can be deployed within a data center to cover all topologies for the entire data center.

Installing the Tool Package

In a similar way to installing the Heron client, find the installation file and install it:

Wget chmod + x heron - https://github.com/twitter/heron/releases/download/0.16.5/heron-tools-install-0.16.5-darwin.sh *. Sh . / heron - tools - install - 0.16.5 - PLATFORM. Sh - userCopy the code

The Tracker tool

Start the Tracker server: heron-tracker

Verify the server restful API: Open http://localhost:8888 in the browser

The UI tools

Launch UI site: Heron-UI

Verify the UI website: Open http://localhost:8889 in your browser

Heron new features

Since Twitter made Heron open source in the summer of 2016, the Heron community has developed many new features, In particular, in 2017, Heron added “online dynamic capacity expansion/reduction”, “Effectively Once Transmission Semantics”, “functional API”, “Multiple programming language support”, and “self-regulating”.

Online dynamic capacity expansion or reduction

According to Storm’s data model, the parallelism of the topology is specified by the author of the topology when the topology is programmed. In many cases, the amount of data that the topology needs to deal with is constantly changing. It is difficult for the topology programmer to predict the appropriate resource configuration, so dynamically adjusting the topology resource configuration is a necessary functional requirement at runtime.

Intuitively, changing the parallelism of the nodes in the topology can quickly change the resource usage of the topology to cope with the changing traffic. Heron implements this dynamic adjustment through the update command. The Heron command line tool uses the packing algorithm to calculate the new packing plan for the topology based on the new parallelism specified by the user, and then increases or decreases the number of containers through the resource pool scheduler. The packing plan is then sent to the Tmaster to be merged into a new physical plan, so that all the container states are consistent in the whole topology. Dynamic adjustment of parallelism implemented by Heron has little impact on the runtime topology and fast adjustment.

Effectively once transmission semantics

In addition to the original tuple transmission mode at most once and at least Once, Heron newly adds effectively Once. At most once and at least once have some disadvantages. At least once repeats some tuples. Well Once aims to be accurate and reliable when calculations are deterministic.

Effectively once can be summarized into two aspects:

  • Distributed state checkpoint;
  • The topology state rolled back.

The tmaster periodically sends a marker tuple to spout. When a node in the topology collects all upstream marker tuples, it writes its status to a state storage. This process is checkpoint. When all nodes in the topology checkpoint, the state storage stores a snapshot of the entire topology. If the topology encounters an exception, the snapshot can be read from the State storage to recover and start processing the data again.

Functional API (Functional API)

Functional programming is a hot topic in recent years, and Heron ADAPTS to the trend of The Times by adding functional API on the basis of the original API. The functional API of Heron allows the Topology programmer to focus on the application logic of the topology rather than the specifics of the topology/spout/ Bolt. Heron’s functional API is a higher level API than the original underlying API, and the implementation behind it is still converted to the underlying API to build topology.

The Heron functional API is built on the concept of Streamlet. A streamlet is an infinite, sequential sequence of tuples. In the data model of the Heron functional API, data processing refers to the transformation from one Streamlet to another. The transformation operations include: map, flatMap, Join, filter, window and other common functional operations.

Multiple programming language support

Whereas topology writers used to write topology using storm-compatible Java apis, Heron now provides Python and C++ apis so that programmers familiar with Python and C++ can write topology. Python and C++ API design is similar to the Java API, they contain the underlying API for constructing dags, and in the future they will provide functional apis for topology developers to focus on the business logic.

In implementation, both Python and C++ apis have heron-instance implementations of Python and C++. They do not overlap with the Java implementation of Heron-Instance, so they reduce the overhead of interlanguage translation and increase efficiency.

Self-regulating

Heron developed the new Health Manager module in conjunction with the Dhalion framework. The Dhalion framework is a framework that reads the metric and adjusts or fixes the topology accordingly. The Health Manager consists of two parts: Detector/Diagnoser and Resolver. The Detector/ Diagnoser reads the metric to detect the topology state and detects an exception, and the Resolver takes measures to restore the topology to normal. The introduction of the Health Manager module enables Heron to form a complete feedback loop.

Two common scenarios are as follows: 1. Detector monitors the length of queues in back pressure and STMGR to find out whether some containers are very slow; The Resolver then tells the Heron-scheduler to reschedule the node to another host; 2. The detector monitors the states of all nodes to calculate whether the topology is resource stressed at the global level. If the topology resource usage is large, the Resolver calculates the resources to be added and informs the Scheduler to schedule.

conclusion

In this article, we compare Heron with common stream processing projects, including Storm, Flink, Spark Streaming and Kafka Streams, and summarize the key points of system selection. In addition, we practice a case study of Heron. Finally, we discussed the new features Heron developed during the year.

Finally, the author hopes that this article can provide you with some experience related to Heron application, and we welcome your suggestions and help. If you are interested in developing and improving Heron, you can check out the Heron website (Heronstreaming.io) and the code (github.com/twitter/her…). .

reference

[1] Maosong Fu, Ashvin Agrawal, Avrilia Floratou, Bill Graham, Andrew Jorgensen, Mark Li, Neng Lu, Karthik Ramasamy, Sriram Rao, And Cong Wang.” Streaming Engines. “In 2017 IEEE 33RD International Conference on Data Engineering (ICDE), pp. 1165-1172. IEEE, 2017. [2] Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. “Twitter Heron: Stream Processing at scale. “In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 239-250. ACM, 2015. [3] Maosong Fu, Sailesh Mittal, Vikas Kedigehalli, Karthik Ramasamy, Michael Barry, Andrew Jorgensen, Christopher Kellogg, Neng Lu, Bill Graham, “Streaming@ Twitter.” IEEE Data Eng. Bull. 38, No. 4 (2015): 15 to 27. [4] storm.apache.org/ [5] storm.apache.org/releases/cu… [6] flink.apache.org/ [7] spark.apache.org/streaming/ [8] kafka.apache.org/documentati… [9] twitter. Making. IO/heron/API/p… [10] github.com/twitter/her… [11] github.com/twitter/her…

Huijun Wu is a software engineer at Twitter, working on the research and development of Heron, a real-time streaming engine. He graduated from Arizona State University, specializing in big data processing and mobile cloud computing. He has published several academic papers in international top journals and conferences, and has a number of patents. Lu Neng is a member of Twitter’s real-time computing platform team. Focused on distributed systems, participated in the development of Twitter’s Manhattan key-value storage system and Obs monitoring alarm system, and is currently responsible for the development research of Heron. He has published many academic papers in top international journals and conferences. Maosong Fu is head of Twitter’s real-time computing platform team, responsible for Heron, Presto and other services. One of the original authors of Heron. He has published several papers in SIGMOD, ICDE and other conference journals, focusing on distributed systems. Graduated from Huazhong University of Science and Technology with a bachelor’s degree. She is a graduate of Carnegie Mellon University. This article is the programmer’s original article, shall not be reproduced without permission.


On January 13th,SDCC 2017 Database Online SummitWill be strong attack, adhering to the dry goods solid material (case) content principle, invited fromalibaba,tencent,weibo,Netease etc.Database experts from a number of enterprises and university researchers will focus on hot database technologies such as Oracle, MySQL, PostgreSQL, Redis, etc., from the deep digging of core technologies to the analysis of high availability practices, to create the essence of compressed sharing, draw a conclusion by reference, think and fight, registration and more details can be foundClick here toLook at it.