This article compares the Spark and Flink engines. Engines are not the only aspect of a data product to consider for users. Developing and operating tools and environments, technical support, community, etc., are all important to being able to build things on top of the engine. These constitute an ecosystem of products. The engine sets the limits of functionality and performance, so to speak, and ecology allows those capabilities to really work.

Any condition

Spark is one of the most active Apache projects. It began to gain widespread attention around 2014. Spark’s open source community at one point had thousands of active contributors. The prime mover is Databricks, the company founded by the original Spark creators. More than 4,000 people attended the Spark+AI summit in June. Spark is widely regarded as a replacement for MapReduce engine because of its superior engine and better integration with Hadoop ecosystem after several years of development.

Flink is also Apache’s top project, and the founders founded Data Artisans. The community is not yet as big as Spark. But in the industry, especially in the flow of processing, has a good reputation. Several of the us companies that are at the forefront of large-scale streaming and have the strongest demand, including Netflix, LinkedIn, Uber, and Lyft, have either adopted Flink as a streaming engine or invested heavily in it, in addition to LinkedIn, which has its own Samza.

Alibaba Group also has a strong influence in the Flink community. In recent Flink 1.3 to 1.5, several blockbuster features were developed by Alibaba in collaboration with Data Artisans or independently. Ali is also probably the world’s largest streaming computing cluster, which is also based on Flink.

Unified Analytic platform

At the recent Spark+AI Summit, Databricks’ main theme was the Unified Analytics Platform. Three new releases, Databricks Delta, Databricks Runtime for ML, and ML Flow, are all around this theme. With machine learning (including deep learning) becoming a bigger part of data processing in recent years, Databricks has once again got its finger on the pulse.

The unified analytics platform echoes Spark’s original intention. After several years of exploration, a relatively specific solution to the initial problem, that is, users can solve most of the needs of big data in one system, has been developed.

But it’s interesting to see a shift in Databricks’ approach to AI. Before deep learning became popular, Spark’s built-in MLLib functionality should have been adequate, but it was not widely adopted as expected, perhaps for compatibility reasons.

For deep learning’s newest favorite, TensorFlow, Spark has previously released TensorFrames and the Spark engine for some integration. It’s not going to be very successful, probably not as big as The TensorFlowOnSpark that Yahoo built from the outside.

In this case, Spark has shifted to an integration strategy. Databricks Runtime for ML is essentially pre-installed with various machine learning frameworks and supports starting a cluster of its own, such as TensorFlow, within the Spark task. The main Spark engine improvement is gang Scheduling, which enables multiple executors to be applied at once so that TensorFlow clusters can be started properly.

MLFlow has nothing to do with Spark. As a workflow tool, MLFlow aims to help data scientists become more productive. The main function is to record and manage machine learning experiments on a project basis and support sharing. Repeatability and easy-to-use support for various tools are key design points. It looks like Spark may not be doing much as an AI engine for the time being.

Flink’s goal is similar to Spark’s. A unified platform with AI is also Flink’s direction. Flink is also technically capable of supporting good machine learning integration and whole links, and has some examples of large-scale online learning. However, it seems that Flink is not quite as platformable as Spark at this stage. It is worth mentioning that Flink may be able to support online learning better due to the advantages of the stream processing engine.

Data user

Products and ecology are ultimately about solving the problems of big data users and generating value from data. Understanding the users of data and their needs helps us to have a clearer context when discussing various aspects of ecology.

Data related workers can be broadly divided into the following roles. In practice, it is likely that several roles in an organization overlap in terms of personnel. Roles are also not universally defined and clearly defined.

  • Data acquisition: Generate or collect data from appropriate locations in products and systems and send it to the data platform.

  • Platform: provides data import, storage, computing environment and tools, etc.

  • Data engineer: Use the data platform to process raw data into data sets that can be used efficiently for subsequent use. Turn metrics, models, etc. created by analysts and data scientists into efficient and reliable automated processes.

  • Data analysts and data scientists (there is much discussion about the similarities and differences between the two. Those who are interested can do their own search. www.jianshu.com/p/cfd94d9e4… Give meaning to data and find value in it. The areas not specifically distinguished below are collectively referred to as data analysis.

  • Product managers, management, and decision makers: adjust product and organizational behavior based on the data generated above.

These make up a complete ring. The order above is the direction in which the data flows, while the demand is driven in the opposite direction.

The ecology of Spark and Flink mentioned in this article mainly corresponds to the data platform layer. The direct users are mainly data engineers, data analysts and data scientists. Good ecology can greatly simplify the work of data platforms and data engineers, making data analysts and data scientists more autonomous and efficient.

The development environment

API

In terms of apis, Spark and Flink offer roughly the same areas of functionality. Of course, the specific degree of support in each direction will be different. Overall, Spark’s API has gone through several iterations and is getting better in terms of ease of use, especially machine learning integration. Flink is a little more mature in flow computation.

API Spark Flink
The underlying API RDD Process Function
Core API DataFrame / Dataset / Structured Streaming DataStream / Dataset / Table API
SQL
Machine learning

MLLib

FlinkML

Figure calculation

GraphX

Gelly

other CEP

The supported languages are also roughly equal. Spark has its advantages for being longer, especially Python and R, which are commonly used for data analysis.

Support language Spark Flink
Java
Scala
Python

Beta

R

The third party

SQL

Connectors

With the API, and the data, we’re ready to go. Both Spark and Flink can connect to most common systems. If you don’t have support, you can write your own connector.

databricks.com/spark/about

www.slideshare.net/chobeat/dat…

Integrated development tool

The requirements of data engineers and data analytics are a little different in this regard.

The nature of data analysis is exploratory, with more emphasis on interaction and sharing. Notebook does a good job of meeting these needs, making it an ideal development tool and a great demo tool. Popular notebooks include Apache Zeppelin, Jupyter, and so on. Databricks developed the Databricks Notebook itself as the main entry point for the service. Zeppelin supports Spark and Flink, and Jupyter only supports Spark.

The job of data engineer is more inclined to deal with the more certain data production, can write the code quickly is one thing. In addition to project management, version management, testing, configuration, debugging, deployment, monitoring, etc., the requirements are similar to those of traditional INTEGRATED development tools. It is also common to reuse an existing business logic code base. Notebook doesn’t meet some of these needs very well. An ideal development tool would be something like IntelliJ plus Spark/Flink, plus some plugins that can submit tasks directly to the cluster for debugging, work flow management like Apache Oozie, etc. I haven’t seen anything in the open source community that puts this together. I’ve seen some close ones in commercial products. Spark and Flink are similar in this respect.

Runtime environment

Deployment mode/Cluster management/Open source closed source

After the application is developed, it is submitted to the runtime environment. Spark and Flink do a good job of supporting a wide range of mainstream deployment environments.

The deployment environment Spark Flink
Independent program
Independent cluster
Yarn
Mesos
Kubernetes

Enterprise platform

Since Spark and Flink support a variety of deployment modes, can an enterprise quickly build a platform that supports Spark or Flink using open source code?

It depends on what effect you want to achieve. The simplest pattern might be to have a separate cluster for each task, or a separate cluster for small teams. This can be done quickly, but with a large number of users, the cost of unified O&M may be too high, requiring users to participate in o&M. Another disadvantage is that the allocation of resources is fixed, while the load will change, resulting in poor resource utilization. Ideally, a large shared cluster with multiple tenants can improve o&M efficiency and maximize resource utilization. This requires a number of efforts, such as different job submission methods, data security and isolation, and so on. For some businesses, perhaps leveraging hosted services, including cloud services, is a worthwhile way to get started.

Club in the district

The Spark community is ahead of the pack in terms of size and activity, after years of growth. And as a German firm, It will be harder for Data Artisans to expand in America. But the Flink community also has a steady following, reaching a sustainable scale.

Things may be different in China. Chinese companies do things faster and are more willing to experiment with new technologies than American companies. Some of China’s innovation scenes also have a higher demand for real time. These are all a little friendlier to Flink.

Recently, Flink’s Chinese community has a series of actions, which is a good opportunity to learn about Flink.

Spark Chinese document at www.apachecn.org/bigdata/spa.

The Chinese community of Flink is at Flink-china.org/.

In addition, Flink Chinese community will also hold Flink Forward China conference in Beijing at the end of this year.

Future Development Trend

A clear trend in the past two years is the increasing proportion of machine learning in data processing. Spark and Flink both support machine learning and other data processing in one system. Whoever can do better will have the upper hand.

Another trend that may be less obvious is that as IOT grows and computing resources and networks continue to evolve, there will be more demand for real-time processing. There aren’t that many businesses out there that really aspire to low latency, so every time a new technology for streaming computing comes along, you can see those same companies. Real-time processing is likely to become increasingly important as new application scenarios emerge and competitive environments evolve. Flink is now in the lead in this area and can become a core advantage if played well.

It is also worth mentioning that open source has become an important consideration for users when choosing data products because users do not want to lock in vendors and worry about continuous support. It is increasingly difficult for closed source products to compete with products based on open source technologies without a decisive advantage.

Total knot

Spark and Flink are general-purpose open source large-scale processing engines that aim to support all data processing in one system for performance improvements. Both have relatively mature ecosystems. Is the next generation of big data engine the most powerful competitor. Spark’s ecosystem is a bit more polished overall, leading for now in machine learning integration and ease of use. Flink has a clear advantage in flow computing, and its core architecture and model are more thorough and flexible. Both also have a lot of room for improvement in terms of ease of use. The next step will be more opportunities for those who can fill in the gaps as quickly as possible and play to their strengths.

For more information, please visit the Apache Flink Chinese community website