Yi Weiping (Hungry?)

Ji Ping (Alibaba Real-time Computing Department)

This article will show you the work of Ele. me’s big data platform in real-time computing and the evolution of computing engines, so that you can understand the strengths and weaknesses of Storm, Spark and Flink. How to choose a suitable real-time computing engine? By what advantages does Flink become ele. me’s first choice? This article will take you through the puzzles one by one.

Present situation of platform

The following is the current architecture diagram of ele. me platform:

Currently, there are about 100 Storm missions and 50 Spark missions, but Flink is still relatively small.

At present, the data volume of our cluster is 60TB per day, the calculation times are 10000000000, and there are 400 nodes. Both Spark and Flink are onyarn. Flink onyarn is mainly used for jobmanager isolation between tasks, while Storm is in standalone mode.

Application scenarios

1. Consistency semantics

Before we cover our application scenarios, let’s highlight an important concept of real-time computing, consistent semantics:

  1. At-all-once: Fire and forget, we usually write a Java application that doesn’t care about offset management at source, doesn’t care about idempotent downstream, it’s simply at-all-once, the data comes in, whatever the intermediate state is, There is no ACK mechanism for the state of the write data.

  2. At-least-once: resends data to ensure that each piece of data is processed at least once.

  3. 2. Rougher checkpoint granularity control is used to implement exactly once. When we say exactly once, we mostly refer to the exactly once in the computing engine, i.e. whether the internal state of the operator at each step can be replayed. If the last job fails, whether it can be successfully recovered from the last state does not involve the concept of idempotency of output to sink.

  4. At-least-one + IDEmpotent = exactly-one Or if you use es Cassandra or something like that, you can get the semantics of upset with the primary key key, at least once, but idempotent, exactly once.

2. Storm

Ele. me used Storm in its early days, it was Storm 16 years ago, Sparkstreaming, structed-streaming started in 17. Storm was used earlier and has the following concepts:

  1. Data is a tuple – -based

  2. Millisecond delay

  3. It mainly supports Java, but now also supports Python and Go with Apache Beam.

  4. The functions of Sql are not complete yet. We have encapsulated TYPHON internally. Users only need to extend some of our interfaces to use many main functions. Flux is a better tool for Storm. It only needs to write a YAML file to describe a Storm task. It satisfies some requirements to some extent, but it still requires that the user be an engineer who can write Java, so data analysts can’t use it.

U 2.1 summarizes
  1. Ease of use: Its popularity is limited by the high barrier to use.

2)StateBackend: more external storage is required, such as KV storage such as Redis.

  1. Resource allocation: Worker and slot are set in advance. In addition, engine throughput is relatively low due to fewer optimization points.

3. Sparkstreaming

One day a business came in with a requirement and said can we write a SQL and publish a real-time computing task in a few minutes. So we started doing Sparkstreaming. Its main concepts are as follows:

  1. Micro-batch: A window is set in advance and data is processed in the window.

  2. Latency is in the order of seconds, preferably around 500ms.

  3. The development languages are Java and Scala.

  4. Streaming SQL, is mainly our work, we hope to provide a platform for Streaming SQL.

Features:

  1. Spark ecology and SparkSQL: This is one of the advantages of Spark. The technology stack is unified. SQL, graph computing, and Machine Learning packages are interchangeable. Because it does batch processing first, unlike Flink, its natural real-time and offline apis are unified.

  2. Checkpointon HDFS.

  3. On Yarn: Spark belongs to the Hadoop ecosystem and is highly integrated with Yarn.

  4. High throughput: Because it is micro-batch, throughput is also relatively high.

Here is a general presentation of the operation page for our platform users to quickly publish a real-time task, and what steps it needs. Instead of writing DDL and DML statements, we’re talking about how the UI presents the page.

In the middle is to ask the user to describe the pipeline. SQL is kafka multiple topic, output select an output table, SQL above consumption kafka DStream registered table, and then write a string of pipeline, finally we help the user package some external sink (just mentioned all kinds of storage are supported, If the store implements upSERt semantics, we all support it).

3.1 MultiStream – Join u

In Spark1.5, you could register DStream as a table by referring to spark-StreamingSQL, an open source project. The table is then joined, but this only supports pre-1.5 versions of structured Streaming, which was deprecated in Spark2.0. We have a tricky way:

U 3.2 Exactly – once

There is one point to note exactly:

We must require the data sink to external storage before the offset can be committed, whether it is to ZooKeeper or mysql, you better ensure that it is in a transaction. Kafka RDD must be generated by the source driver based on the offeset of the store after the output is sent to the external store. Executor then consumes data based on kafka partition offset. If these conditions are met, end-to-end exact-once can be implemented, which is a big premise.

U 3.3 summarizes
  1. Stateful Processing SQL (<2.x mapWithState, updateStateByKey) was designed to be used to compute Stateful Processing SQL (<2.x mapWithState, updateStateByKey). However, you still need to save this state to HDFS or external, which is a bit more difficult to implement.

  2. Real multi-stream Join: There is no way to implement the semantics of true multi-stream joins.

  3. ** End-to-end Exactly-Once Semantics: ** Its end-to-end Exactly-Once Semantics are cumbersome To implement, requiring that offset be manually committed in a transaction after sink To external storage.

4. STRUCTURED STREAMING

We investigated and then used sparK2. X with state increments. The following image is from the official website:

The structure of structured Streaming is shown below:

U 4.1 features
  1. Stateful Processing SQL&DSL: supports Stateful flow computation

  2. Real multi-stream Join: Spark2.3 allows you to Join multiple streams. The Join of multiple streams is similar to Flink. You need to define the conditions of two streams (mainly time as a condition). Then you want to restrict the data to be buffered by a field (usually event time) in a specific schema, so that a true stream join can be implemented.

3) It is relatively easy to realize the end-to-end semantics of exactly-once, which can be realized only by extending sink’s interface to support idempotent operations.

In particular, the Structured Streaming API differs from the native Streaming API in that when creating a Dataframe for a table, the schema must be specified, meaning that you need to specify the schema ahead of time. Watermark didn’t support SQL, so we added an extension to write SQL completely and convert it from left to right. We wanted to use it not only for programmers, but also for data analysts who didn’t know how to write programs.

U 4.2 summarizes
  1. Trigger(Processing Time, Continuous): Before 2.3, it is mainly based on Processing Time. After each batch of data is processed, the calculation of the next batch will be triggered immediately. 2.3 Introduced the continuous processing trigger of Record by record.

  2. Continuous Processing (Only map-like Operations): Currently, it Only supports map-like Operations and SQL support is limited.

  3. Lowend-to-end Guarantees require some additional extensions of their own, and we found that kafka0.11 provides transactional functionality, Based on this consideration, end-to-end exactly-once can be realized from source to engine and then to sink in a real sense.

  4. CEP(Drools): We found a business that needed to provide complex event processing capabilities like CEP, which our syntax currently does not support directly. We let users use the rules engine Drools, then run on top of each executor, and rely on the rules engine functionality to implement CEP.

Based on the above characteristics and disadvantages of Spark Structured Streaming, we consider using Flink to do these things.

5.Flink

The goal of Flink is to benchmark Spark. The stream part is more advanced and ambitious. It has graph calculation, machine learning, etc., and also supports YARN, TEZ, etc. Flink community official support is better, relatively speaking, for more storage used by the community.

Flink frame diagram:

In Flink, JobManager is the Spark Driver role, and TaskManger is the Executor role. The tasks in Flink are similar to those in Spark. However, The RPC used by Flink is AKka, and Flink Core has a custom memory serialization framework, and tasks do not have to wait for each other like Spark’s tasks at each Stage, but send data downstream after processing.

Flink binary data operator

Spark serialization users generally use kryo or Java’s default serialization, while the Tungsten project has optimized Spark for JVM level and code generation. In contrast to Spark, Flink implements its own memory-based serialization framework, which maintains the concepts of key and pointer. The key of Flink is continuously stored, which is optimized at the CPU level and has a very low probability of cache miss. When comparing and sorting, you don’t need to compare the actual data. The key comparison is performed first. When it is equal, the data is deserialized from memory and compared to the actual data.

Flink Task Chain:

Operator chain is a good concept in Task. If the upstream and downstream data distribution does not need to be reshuffled, for example, source is kafka source in the figure, followed by map is a simple data filter, we put it in a thread, can reduce the cost of thread context switch.

Parallelism concept

For example, if there are 5 tasks, there will be several concurrent threads to run. If the chain is put into one thread to run, the data transmission performance can be improved. Spark is black-box and cannot set the concurrency for each operator, whereas Flink can set the concurrency for each operator, which makes jobs more flexible and more resource efficient.

Spark usually by Spark. Default. Parallelism to adjust the parallelism, have the shuffle operation, parallelism is generally through the Spark. SQL. Shuffle. Partitions to adjust parameters, real-time computing actually should adjust a little bit small. For example, the number of partitions in our production and kafka are almost the same, and the batch will be set to 1000. In the left picture, we set the concurrency to 2, the maximum is 10, so that the first two concurrent running, and according to the concept of grouping according to the key, the maximum is 10 groups. You can break up the data as much as possible.

State & Checkpoint

Flink data is processed one by one. Therefore, each data in Flink is sent to the downstream immediately after it is processed, unlike Spark, where all tasks of the stage on which the operator belongs are completed.

Flink has a coarse-grained checkpoint mechanism, which assigns a snapshot concept to each element at a very small cost. The calculation is triggered only when all data that belongs to the snapshot is entered, and then the buffer data is sent. Currently, Flink SQL does not provide an interface to control buffer timeout, that is, how long it takes for my data to be buffer delivered. When constructing Flink context, you can specify a buffer timeout of 0 to send the data immediately after processing. There is no need to wait for a certain threshold before sending the data.

Backend is maintained in jobManager memory by default, but is mostly written to HDFS. The status of each operator is written to Rocksdb, and then asynchronously incrementally synchronized to external storage.

Fault tolerance

The red node on the left part of the figure has a failover. If it is at-least-once, the data at the upstream of the node will be retransmitted once. In the case of exactly-once, however, each compute node needs to be replayed from the time of the last failure.

Exactly Once Two-Phase Commit

Flink1.4 is followed by two phase commits to support exactly-once. The idea is that after upstream Kafka consumes data, a vote is taken at each step to record the status, and the checkpoint barrier is used to process the tokens. Only when kafka is finally written (versions after 0.11), The status of each step will be notified by the Cordinator in JobManager to be solidified, thus achieving exactly-once.

Savepoints

Another thing Flink does well is implement savePoint based on its checkpoint. The business needs each application recovery node to be different, and the version you want to recover to can be specified, which is good. This savepoint not only restores data, but also restores computational state.

Features:

  1. Trigger (Processing Time, Event Time, IngestionTime): By contrast, Flink supports richer streaming semantics, not only Processing Time, Event Time and Ingestion Time are also supported.

Continuous Processing & Windows: Supports pure Continuous Processing, record by record, Windows is better than Spark.

  1. Low end-to-end Guarantees: Because there are two phase commits, users can choose To sacrifice throughput and adjust To business needs To guarantee end-to-end just-once.

  2. CEP: Good support.

  3. Savepoints: You can do some version control based on business requirements.

There are also some that are not good:

1) SQL (Syntax Function, Parallelism):SQL is not fully functional. Most users migrate from Hive. Spark supports more than 99% of Hive coverage. SQL functions are not supported. Currently, you cannot set the parallelism of a single operator.

  1. ML, Graph, etc. : Machine learning, Graph computing and other areas are weaker than Spark, but the community is also working on improving this issue.

Subsequent planning

Because ele. me is now a member of Ali, Flink will be used more in the future, and Blink is also expected to be used.

For more information, please visit the Apache Flink Chinese community website