From Hadoop to Spark and Flink, big data processing framework has been developing for ten years!

In the current data era, a large amount of data is generated in various fields and business scenarios all the time. How to understand big data and effectively process these data has become a problem faced by many enterprises and research institutions. This article will start with the basic features of big data, then explain the idea of divide and conquer processing, and finally introduce some popular big data technologies and components. Through this article, readers can understand the concepts, processing methods and popular technologies of big data.

What is big data?

Big data, as the name suggests, is the possession of large amounts of data. As for what is big data, how to define it and how to use it, friends from different fields and backgrounds have different understandings. IBM classifies big data into five V’s, which cover most of the features of big data.

Volume: A large amount of data, ranging from TB (1,024 GB), PB (1,024 TB), EB (1,024 PB), ZB (1,024 EB) or YB (1,024 ZB). The New York Stock Exchange generates about terabytes of trading data a day, and the Large Hadron Collider near Geneva, Switzerland, generates about petabytes of data a year. The global total is now at the ZB level, equivalent to 1,000,000 petabytes, or the more familiar billion terabytes. Based on larger data sets, we can have a more complete understanding of a subject’s history, present and future.
Velocity: Fast data generation requires high processing speed and timeliness, because time is money. Trading data in financial markets must be processed in seconds, and search and recommendation engines need to push real-time news to users in minutes. Faster data processing allows us to make more real-time decisions based on the latest data.
Variety: Data comes in a Variety of forms, including numbers, text, pictures and video, but also from social networks, video sites, wearable devices and sensors. The data can be highly structured data in Excel, or it can be unstructured data like images and videos.
Veracity: Data authenticity. On the one hand, data are not naturally of high value, and some outliers can be mixed in, such as statistical bias, human emotional influence, weather, economic factors and even false reporting of data. On the other hand, there are different types of data sources and various data sources. How to connect, match, clean and transform these diverse and heterogeneous data to form data with high confidence is a very challenging task.
Value: indicates the data Value. The ultimate purpose of our research and use of big data is to provide more valuable decision support. Based on the four V’s mentioned above, we can explore the deep value of big data.

In the field of data analysis, the entire Population of a subject is known as a Population, which contains a large amount of data, possibly infinite. Take, for example, the honesty of the people of 15 countries. In many cases, it is not possible to collect and analyze all the data for the population, so researchers usually analyze data based on a subset of the subjects. Sample is an individual selected from the population and a subset of the research object. Through the investigation and analysis of the Sample, researchers can speculate about the overall situation. In the case of honesty survey, we can take a part of the citizens of each country as a sample to predict the honesty level of the citizens of that country.

Before the maturity of big data technology, the number of samples was relatively small due to the limitations of data collection, storage and analysis capabilities. With the emergence of big data technology, data storage and analysis capabilities are no longer the bottleneck, and researchers can conduct data analysis at a faster speed on a larger scale. But data is not inherently valuable, and turning it into gold is a challenge. In the integrity survey, if we simply asked our sample: “Have you lied about your assets and those of your family to get a bigger financial loan?” Nine times out of ten, we can’t get the real answer, but we can analyze this question through a variety of channels, such as combining the work experience and credit investigation records of the sample objects.

It can be seen that big data has the characteristics of larger data volume, faster speed and more data types. On the basis of certain data authenticity, big data technology ultimately serves the value behind the data.

With the development of big data technology, the complexity of data is getting higher and higher. Some people put forward some supplements on the basis of the five V’s. For example, Vitality is added to emphasize the dynamic nature of the whole data system. Increased Visualization, which emphasizes explicit presentation of data; The Validity is increased to emphasize the Validity of data collection and application, especially the rational use of personal privacy data. Added data Online to emphasize that data is always Online and can be called and calculated at any time.

Distributed computing divides and conquer

Since the birth of computers, data processing is generally done on a single computer. After the advent of the era of big data, some traditional data processing methods can not meet the requirements of big data processing, so the engineering practice of organizing a group of computers together to form a cluster and using the power of cluster to process big data has gradually become the mainstream scheme. This way of using clusters for computing is called distributed computing, and almost all big data systems are currently performing distributed computing in clusters.

The concept of distributed computing sounds sophisticated, but the idea behind it is simple: Divide and Conquer, also known as divide-and-conquer. The divide-and-conquer method divides an original problem into sub-problems, which are solved on multiple machines respectively. The final result can be obtained by summarizing the sub-results with the help of necessary data exchange and merging strategies. To be specific, different distributed computing systems use different algorithms and strategies based on the problem they are trying to solve, but they basically split up the computation, putting sub-problems on multiple machines, and solving them on a divide-and-conquer basis. Each machine (physical machine or virtual machine) in distributed computing is also called a node.

Distributed computing has many mature schemes in scientific research, among which Message Passing Interface (MPI) and MapReduce are well known.

MPI

MPI is an old distributed computing framework, which mainly solves the problem of data communication between nodes. In the pre-MapReduce era, MPI was the industry standard for distributed computing. MPI programs are still widely used in supercomputing centers, universities, government and military research institutes around the world. Many large-scale distributed computing in physics, biology, chemistry, energy, aerospace and other basic disciplines rely on MPI.

The divide-and-conquer method splits the problem into subproblems to be solved on a divide-and-conquer basis on different nodes. MPI provides a solution for data communication between multiple processes and nodes, because in most cases, data on multiple nodes needs to be exchanged and synchronized during intermediate computation and final merge.

MPI’s ability to control the communication of data at a fine-grained level is both its strength and its weakness, since fine-grained control means that programmers need to manually control everything from divide-and-conquer algorithm design to data communication to summary of results. Experienced programmers can optimize the program at the bottom level and achieve exponential speed improvements. But if don’t have much experience in computer and distributed systems, coding, debugging and running time cost is too high for the MPI program and data on different nodes disequilibrium and communication delay problems, such as a node failure will cause the failure of the entire program, as a result, the MPI for most programmers, it is simply a nightmare.

Not all programmers are proficient in MPI programming, and measuring the time cost of a program takes into account not only the time it takes to run the program, but also the time it takes to learn, develop, and debug. Just as C is extremely fast, but Python is more popular, MPI offers extremely fast distributed computing acceleration, but it’s not down to earth.

MapReduce

In order to solve the problem of high cost of learning and using distributed computing, researchers put forward a more simple and easy to use MapReduce programming model. MapReduce is a programming paradigm proposed by Google in 2004. Instead of MPI leaving everything in the hands of the programmer, the MapReduce programming model requires only the programmer to define two operations: Map and Reduce.

map
shuffle
reduce
map
shuffle
reduce

Different teams have implemented their own big data frameworks based on the MapReduce programming model: Hadoop was the original open source implementation that has become the industry benchmark for big data, followed by Spark and Flink. These frameworks provide programming interfaces and apis to assist programmers in storing, processing, and analyzing big data.

Compared with MPI, graphs programming model will be more in the middle of the process for package, the programmer needs only to the original problem is transformed into a higher level of API, as for the original problem how to split into smaller sub-problems, how intermediate data transmission and exchange, how to calculate telescopic extension to multiple nodes can give details a series of problems such as large data framework to solve. Therefore, MapReduce is relatively easy to learn, easier to use, and faster to program.

Batch and stream processing

Data and data flow

In the 5V definition of big data, we have already mentioned that the capacity of data is large and the production speed is fast. From the perspective of time, data will be continuously generated to form an Unbounded data Stream. For example, every moment of our movement data is accumulated on our mobile phone sensors, and financial transactions occur anytime and anywhere, and sensors constantly monitor and generate data. A Bounded Stream in a data Stream can constitute a data set. When we talk about an analysis of a piece of data, we mean an analysis of a data set. With the faster and faster data generation, more and more data sources, people pay more and more attention to timeliness, how to deal with data flow has become a more concerned problem.

The batch

Batch Processing refers to Processing a Batch of data. Batch calculation can be found everywhere around us. The simplest batch calculation example is: wechat sports has a batch task every night, counting the number of steps taken by the user’s friends in a day, generating sorting results and pushing them to the user; The credit card center of the bank has a batch task on the monthly bill day, which calculates the total amount of consumption in a month and generates monthly bills for users. The National Bureau of Statistics makes quarterly statistics on economic data and publishes quarterly GDP growth. It can be seen that batch tasks are generally processed after data aggregation over a period of time. For applications with a large amount of data, such as wechat sports and bank credit cards, the total amount of data accumulated in a period of time is very large, and the calculation is time-consuming.

The history of batch computing can be traced back to the 1960s when computers just started. At present, ETL (Extract Transform Load) data transformation work is the most widely used in data warehouse. For example, commercial data warehouses represented by Oracle and open source data warehouses represented by Hadoop/Spark.

Stream processing

As mentioned above, data is actually continuously produced in the form of Stream, and Stream Processing is to process data streams. Time is money, and it becomes more and more important to analyze and process data streams to capture the value of real-time data. Individual users every night to see a wechat sports ranking feel is a more comfortable rhythm, but for the financial sector, time is millions, millions or even billions of money as a unit! For the e-commerce promotion of “Double Eleven”, managers should check real-time sales performance, inventory information and comparison results with competing products with second-level response time, so as to gain more decision-making time; Stock trading responds to new information in milliseconds; Risk control should deal with each fraudulent transaction quickly to reduce unnecessary losses; Network operators need to detect network and data center failures and so on very quickly. In each of these scenarios, the loss is incalculable in the event of a failure resulting in service delay. Therefore, the faster the response, the more loss can be reduced and revenue can be increased. The rise of the Internet of Things and 5G communication will provide a more perfect underlying technology foundation for data generation. Massive data will be collected and generated on the IoT devices, and transmitted to the server through a higher speed 5G channel, and a larger real-time data flow will surge to the demand for streaming processing will definitely explode.

Representative big data technology

As mentioned above, the MapReduce programming model created a precedent for big data analysis and processing, followed by the emergence of Big data frameworks such as Hadoop, Spark and Flink.

Hadoop

In 2004, the founders of Hadoop were inspired by a series of papers, including the MapReduce programming model, to implement the ideas mentioned in the paper. Hadoop takes its name from founder Doug Cutting’s son’s toy elephant. Hadoop is often referred to as yahoo’s open source big data framework, since founder Doug Cutting joined Yahoo at the time and supported a lot of Hadoop development work during that time. Today, Hadoop is not only the pioneer and leader in the whole field of big data, but also forms an ecosystem around Hadoop. Hadoop and its ecosystem are the preferred big data solution for most enterprises.

Although there are many components in the Hadoop ecosystem, there are three main core components:

Hadoop MapReduce: Hadoop version of the MapReduce programming model, which can process massive data, mainly for batch processing.
HDFS: The Hadoop Distributed File System (HDFS) is a Distributed File System provided by Hadoop. It has high scalability and fault tolerance.
YARN: YARN is an abbreviation of Yet Another Resource Negotiator in the Hadoop ecosystem that manages a Hadoop cluster and allocates computing resources to various types of big data tasks.

In the three components, data is stored in the HDFS, MapReduce calculates data, and YARN manages cluster resources. In addition to the three core components, there are many other well-known components in the Hadoop ecosystem:

Hive: With Hive, users can write SQL statements to query structured data in the HDFS. The SQL statements are converted to MapReduce and executed.
HBase: The HADOOP Distributed File System (HDFS) is a distributed database based on the HADOOP Distributed File System (HDFS), which provides users with real-time query services in milliseconds.
Storm: Strom is a real-time computing framework for stream processing.
Zookeeper: Many components of the Hadoop ecosystem are named after animals, forming a large zoo. Zookeeper is the manager of the zoo and is responsible for coordinating the distributed environment.

Spark

Spark was born in 2009 at the University of California, Berkeley, and was donated to the Apache Foundation in 2013. Spark is a big data computing framework designed to improve the programming model and execution speed of Hadoop MapReduce. Compared to Hadoop, Spark has two major improvements:

Ease of use: The MapReduce model is more friendly than MPI, but it is still not convenient because not all computing tasks can be easily divided into Map and Reduce. It is possible to design multiple MapReduce tasks to solve a problem, and the tasks are interdependent. The whole program is very complex, resulting in poor readability of the code. Spark provides easy-to-use interfaces and apis in Java, Scala, Python, and R languages. It supports SQL, machine learning, and graph computing, covering most scenarios of big data computing.
Fast: HadoopmapandreduceWhile Spark tries to keep most of its calculations in memory, coupled with Spark’s directed acyclic graph optimization, Spark was over 100 times faster than Hadoop in official benchmarks.

Spark focuses on computing, optimizes Hadoop MapReduce computing, and provides detailed computing services. For example, It provides apis of several commonly used data science languages and supports SQL, machine learning, and graph computing. These services are ultimately computation-oriented. Spark is not a complete replacement for Hadoop. In fact, Spark is integrated into the Hadoop ecosystem as an important element. A Spark task may apply for computing resources from YARN, depending on the HDFS data, and use HBase as the output destination. Of course, Spark can also perform calculations independently of these Hadoop components.

Spark is mainly for batch processing. Due to its excellent performance and easy-to-use interface, Spark is the absolute king of batch processing. Spark Streaming provides the function of stream processing. Its stream processing is mainly based on the idea of mini-batch. The input data stream is divided into multiple batches, and each batch is calculated using the method of batch processing. Therefore, Spark is a batch and streaming computing framework.

Flink

Flink was an academic project initiated by several German universities and has since grown to become an Apache Top project in late 2014. Flink is stream-oriented, and if Spark is the king of batch processing, Flink is the rising star of stream processing. Before Flink, there were Streaming engines such as Storm and Spark, but some features were not as good.

Flink is a big data engine that supports stateful computation on bounded and unbounded data streams. It is event-based and supports SQL, State, WaterMark, and other features. It supports “exactly once”, where events are guaranteed to be delivered only once, no more, no less, which improves data accuracy. Compared to Storm, it has higher throughput, lower latency and guaranteed accuracy; Compared to Spark Streaming, it achieves true real-time computing on a per-event basis and requires relatively less computing resources.

As mentioned earlier, data is generated as a stream. Data can be divided into bounded and unbounded. Batch processing is actually a bounded data flow, which is a special case of stream processing. Based on this idea, Flink has evolved into a big data framework that can support streaming and batch processing.

Over the years, Flink’s API has matured to support Java, Scala, and Python, as well as SQL. Flink’s Scala version of the API is very similar to Spark, and programmers with Spark experience can familiarize themselves with the Flink API in an hour.

Like Spark, Flink is currently computation-oriented and highly integrated with the Hadoop ecosystem. Spark and Flink have different strengths and are learning from each other as they compete. It remains to be seen which one will prevail.

summary

Big data is generally distributed computing based on the idea of divide and conquer. After more than ten years of development, a large number of excellent components and frameworks have emerged in the big data ecosystem. These components encapsulate some underlying technologies and provide programmers with easy-to-use API interfaces. In the field of big data analysis and processing, Hadoop has developed into a very mature ecosystem, covering many basic services related to big data. Spark and Flink focus on big data computing, and establish their advantages in batch processing and stream processing respectively.