Hadoop
Let’s start by looking at what Hadoop solves. Hadoop is the reliable storage and processing of big data (too big for one computer to store, too big for one computer to process in the required time).
HDFS provides highly reliable file storage in a cluster consisting of common PCS. It saves multiple copies of blocks to prevent server or hard disk failures.
MapReduce, through a simple Mapper and Reducer abstraction provides a programming model that can concurrently and distributed process a large number of data sets in an unreliable cluster consisting of tens or hundreds of PCS, while hiding the computational details of concurrency, distribution (such as inter-machine communication) and failure recovery. The abstraction of Mapper and Reducer is the basic element that all kinds of complex data processing can be decomposed into. In this way, complex data processing can be decomposed into a directed non-loop graph (DAG) composed of multiple jobs (including a Mapper and a Reducer), and then each Mapper and Reducer are executed on the Hadoop cluster, and the result can be obtained.
WordCount for example, see WordCount – Hadoop Wiki. If you are not familiar with MapReduce, this example will help you understand MapReduce.
Shuffle is a very important process in MapReduce. It is because of the invisible Shuffle process that the developers writing data processing on MapReduce are completely unaware of the existence of distribution and concurrency.
Shuffle, broadly defined, refers to a sequence of processes in the diagram between Map and Reuce.
Limitations and limitations of Hadoop
However, MapRecue has the following limitations that make it difficult to use.
Low level of abstraction, the need to manually write code to complete, the use of difficult to get started.
It provides only two operations, Map and Reduce, and lacks expression power.
A Job has only two phases, Map and Reduce. Complex computations require a large number of jobs, and the dependencies between jobs are managed by the developers themselves.
The processing logic is buried in the details of the code; there is no overall logic
Intermediate results are also stored in the HDFS file system
ReduceTask needs to wait until all mapTasks have been completed before you can start
High delay, only applicable to Batch data processing, interactive data processing, real-time data processing support is not enough
Poor performance for iterative data processing
For example, joining two tables using MapReduce is a tricky process, as shown below:
As a result, since Hadoop was launched, there have been a number of related technologies to improve upon these limitations, such as Pig, Cascading, JAQL, OOzie, Tez, Spark, etc.
Apache Pig
Apache Pig is also part of the Hadoop framework, and Pig provides an SQL-like language (Pig Latin) for processing large-scale semi-structured data through MapReduce. Pig Latin is a more advanced procedural language that abstracts the design patterns in MapReduce into operations such as Filter, GroupBy, Join, and Order Derby, which form directed acyclic graphs (DAGs). For example, the following procedure:
Describes the whole process of data processing.
Pig Latin, in turn, is compiled to MapReduce and executed on a Hadoop cluster. When compiled into MapReduce, the above program produces Map and Reduce as shown in the following figure:
Apache Pig solves MapReduce’s problems of large amount of handwritten code, semantic hiding, and low variety of operations. Similar projects include Cascading, JAQL, etc.
Apache Tez
Apache Tez is part of HortonWorks’ Stinger Initiative. As an execution engine, Tez also provides directed acyclics (DAG). DAG consists of Vertex and Edge. Edge is an abstraction of data movement, providing one-to-one, BroadCast, and Scatter-gather. Shuffle is performed only when scatter-Gather.
Take the following SQL as an example:
On the way, the blue square represents Map, the green square represents Reduce, and the cloud represents write barrier (a kernel mechanism, which can be understood as persistent write). The optimization of Tez is mainly reflected in: Removing the write barrier between two consecutive jobs and removing redundant Map stages in each workflow.
By providing DAG semantics and operations, providing overall logic, Tez improves the performance of data processing by reducing unnecessary operations.
Apache Spark
Apache Spark is an emerging engine for big data processing. It provides a cluster distributed memory abstraction to support applications that require working sets.
The abstraction is Resilient Distributed Dataset (RDD). RDD is an immutable record collection with partitions. RDD is also the programming model in Spark. Spark provides two types of operations on the RDD, transformations and actions. Transformations are used to define a new RDD, Including Map, flatMap, filter, Union, sample, Join, groupByKey, coGroup, ReduceByKey, Cros, sortByKey, mapValues and so on, the action is to return a result, These include Collect, reduce, Count, Save, and lookupKey.
Spark’s API is very simple to use. An example of Spark’s WordCount is shown below:
The file is the RDD created according to the files in HDFS, and the subsequent flatMap, Map and reduceByKe all create a new RDD. A short program can perform many transformations and actions.
In Spark, all RDD conversions are lazily evaluated. The RDD conversion generates a new RDD. The data of the new RDD depends on the data of the original RDD, and each RDD contains multiple partitions. A program then essentially constructs a directed acyclic graph (DAG) of interdependent RDDS. This directed acyclic graph is submitted to Spark as a Job by performing an action on the RDD.
For example, the WordCount program above generates the following DAG
Spark schedules directed acyclic graph jobs, determines stages, partitions, pipelines, tasks, and caches, optimizes them, and runs jobs on Spark clusters. RDD dependencies are classified into wide dependencies (dependent on multiple partitions) and narrow dependencies (dependent on only one partition). When determining the phases, you need to divide the phases according to the wide dependencies. Divide tasks by partition.
Spark supports different fault recovery methods. Two methods are provided: Linage: Check the blood relationship of the data, perform the previous processing again, and Checkpoint: store the data set to persistent storage.
Spark provides better support for iterative data processing. The data for each iteration can be kept in memory rather than written to a file.
Spark’s performance is significantly better than Hadoop’s. In October 2014, Spark completed a Sort Benchmark test of the Daytona Gray category, sorting it entirely on disk, and the results compared to Hadoop’s previous tests are shown in the table below:
According to the table, Spark uses only 1/10 of the computing resources used by Hadoop to sort 100TB data (1 trillion pieces of data), and takes only 1/3 of the time of Hadoop.
Spark provides a unified data processing platform for batch processing (Spark Core), interactive (Spark SQL), Streaming (Spark SQL), machine learning (MLlib) and GraphX. This is a big advantage over using Hadoop.
“One Stack To Rule Them All,” as Databricks calls it
Especially in some cases, you need to do some ETL work, and then training a machine learning model, and finally make some queries, if the Spark is used, you can in a program logic of the three parts to complete form a large directed acyclic graph (DAG), and Spark to directed acyclic graph of the whole optimization.
For example, the following program:
The first line of the program finds some points using Spark SQL, the second line trains a model using the points using MLlib k-means algorithm, and the third line uses Spark Streaming to process messages in the stream using the trained model.
Lambda Architecture
Lambda Architecture is a reference model for a big data processing platform, as shown in the figure below:
There are 3 layers, Batch Layer, Speed Layer, and Serving Layer. Because the data processing logic of the Batch Layer and Speed Layer is the same, if Hadoop is used as the Batch Layer, With Storm as the Speed Layer, you need to maintain two pieces of code that use different techniques.
Spark can serve as an integrated solution for Lambda Architecture, which is roughly as follows:
Batch Layer, HDFS+Spark Core: Adds real-time incremental data to the HDFS. Spark Core is used to process full data in batches and generate views of full data. .
Speed Layer and Spark Streaming process real-time incremental data and generate real-time data views with low latency.
Serving Layer, HDFS+Spark SQL (and perhaps BlinkDB) stores views of the Batch Layer and Speed Layer output, provides ad-hoc queries with low latency, and merges views of bulk data with views of live data.
conclusion
If MapReduce is recognized as a low-level abstraction of distributed data processing, similar to an and gate or a gate and not gate in logic gates, Spark’s RDD is a high-level abstraction of distributed big data processing, similar to an encoder or decoder in logic circuits.
RDD is a distributed data Collection. Any operation of this Collection can be as intuitive and simple as operating the Collection in memory in functional programming, but the realization of Collection operation is actually decomposed into a series of tasks in the background and sent to a cluster composed of dozens or hundreds of servers. Apache Flink, a recently launched framework for big Data processing, also uses Data sets and operations on them as a programming model.
The execution of a directed acyclic graph (DAG) consisting of RDD is generated and optimized by the scheduler and then executed on the Spark cluster. Spark also provides an execution engine similar to MapReduce, which uses more memory rather than disks for better performance.
So what problems does Spark solve with Hadoop?
Low level of abstraction, the need to manually write code to complete, the use of difficult to get started.
=> Based on RDD abstraction, the code for real data processing logic is very short.
It provides only two operations, Map and Reduce, and lacks expression power.
=> Provides many transformations and actions. Many basic operations such as Join and GroupBy have been implemented in RDD transformations and actions.
A Job has only two phases, Map and Reduce. Complex computations require a large number of jobs, and the dependencies between jobs are managed by the developers themselves.
=> A Job can contain multiple RDD conversion operations, and multiple stages can be generated during scheduling. If the RDD partitions of multiple MAP operations remain unchanged, they can be placed in the same Task.
The processing logic is buried in the details of the code; there is no overall logic
In Scala, RDD transformations support streaming apis through anonymous functions and higher-order functions that provide a holistic view of the processing logic. The code does not contain the implementation details of specific operations, the logic is clearer.
Intermediate results are also stored in the HDFS file system
If => The intermediate result is stored in the memory, it will be written to the local disk instead of the HDFS.
ReduceTask needs to wait until all mapTasks have been completed before you can start
=> Conversions with the same partition form an assembly line and run in a Task. Conversions with different partitions need Shuffle and are divided into different stages. They can be started only after the completion of the previous stages.
High delay, only applicable to Batch data processing, interactive data processing, real-time data processing support is not enough
=> Provide Discretized Stream by splitting the Stream into small batch for processing Stream data.
Poor performance for iterative data processing
=> Improves the performance of iterative calculation by caching data in memory.
Therefore, it is a trend of technological development that Hadoop MapReduce will be replaced by a new generation of big data processing platforms. Among the new generation of big data processing platforms, Spark is currently most widely recognized and supported. This can be seen in the various Spark platform developments by vendors participating in Spark Summit 2014.