Welcome to “Spark, From Getting Started to mastering” by Meitu Data Technology team. This series of articles will introduce Spark from the basics to the implementation of the underlying architecture. I believe there is always a posture suitable for you
What is Spark?
Spark is an open source Hadoop MapReduce-like general parallel framework developed by UC Berkeley AMP Lab. It is a fast and universal big data processing engine and lightweight unified platform designed for large-scale data processing.
When we talk about Spark, we may refer to a Spark application that replaces MapReduce, a big data batch processing program that runs on Yarn and is stored in the HDFS. Spark SQL, Spark Streaming, etc. Even the unified platform for big data processing like Tachyon and Mesos, or Spark ecology.
Figure 1
Today, Spark is more than just an alternative to MapReduce. It has evolved into a Spark ecosystem with many sub-projects. As shown in Figure 1, Spark ecology can be divided into four layers:
-
Data storage layer, some distributed file storage systems or various databases represented by HDFS and Tachyon;
-
Resource management layer: Resource managers such as Yarn and Mesos
-
Data processing engine;
-
At the application layer, many projects are generated based on Spark.
Spark SQL provides an API for interacting with Spark through HiveQL (Hive Query Language, an SQL variant of Apache Hive). Each database table is treated as an RDD, and Spark SQL queries are converted to Spark operations. Spark Streaming processes and controls real-time data streams. It allows programs to process real-time data just like normal RDD.
The next series of articles will describe the other modules and sub-projects in the Spark ecosystem in detail, and then explain the characteristics and principles of the data processing engine Spark by comparing it with MapReduce.
The characteristics of the Spark
According to Google and Baidu search results, Spark’s search trends have equalized or even surpassed Hadoop’s, indicating that Spark has become the de facto standard for computing, which means that big data technology cannot be bypassing It.
In big data storage, computing, and resource scheduling, Spark mainly solves computing problems, that is, it mainly takes the place of Mapreduce. Many companies still use HDFS and Yarn to carry the underlying storage and resource scheduling. Why do enterprises choose Spark as a processing engine in the Hadoop ecosystem? Let’s take a closer look at what it has.
1. The speed is fast. Spark performs calculations based on memory (some calculations are based on disks).
2. Easy to develop. Compared with the Map-Reduce-based computing model of Hadoop, Spark’s RDD-based computing model is easier to understand and implement complex functions, such as quadratic sorting and topN operations. ;
3. Super versatility. Spark provides Spark RDD, Spark SQL, Spark Streaming, Spark MLlib, Spark GraphX and other technical components. Can one-stop complete the field of big data offline batch processing, interactive query, streaming computing, machine learning, graph computing and other common tasks;
4. Integration of Hadoop. Spark is perfectly integrated with Hadoop. HDFS, Hive, and HBase of Hadoop are popular big data solutions. Yarn is used for resource scheduling. Spark is used for big data computing.
4. Extremely active. Spark is currently a top project of the Apache Foundation. IT has a large number of outstanding engineers around the world as Its committers and is used on a large scale by many of the world’s top IT companies.
Look at MapReduce, which is also responsible for computing the problem, as shown in Figure 2, where MapReduce computes WordCount.
Figure 2
MapReduce addresses a variety of scenarios in big data processing, but its limitations are clear:
-
MapReduce provides only Map and Reduce operations. Therefore, MapReduce lacks expressive power and requires a large number of jobs to complete complex calculations.
-
The intermediate results are also stored in the HDFS file system, and the iterative calculation is inefficient.
-
For Batch data processing, real-time data processing is not enough to support interactive data processing.
-
It takes a lot of low-level code to get started. The WordCount program shown above requires at least three Java classes: Map, Reduce, and Job, which are not listed here.
Many projects have improved on its limitations (e.g. Tez, etc.), and take a look at how Spark works in Figure 3:
Figure 3
First of all, Spark provides abundant operators (textFile, FlatMap, Map, ReduceByKey, etc.), and there is no operation to store the intermediate results in HDFS. Then, for the WordCount program above, Spark only needs the following line of code:
sc.textFile(s"${path}").flatMap(_.split("")).map(word => (word, 1)).reduceByKey(_ + _).saveAsTextFile("hdfs://xxx")Copy the code
Figure 4 shows some comparisons between Spark and MapReduce as data processing engines. It is worth mentioning about the scale of data processing. After Spark was born, there was a lot of skepticism in the community about the scale of data processing. Then the official experiment of sorting one petabyte of data was given, and the processing time broke the record at that time. However, we should not ignore that in the actual production process, we are not dealing with one program or one task. In the same cluster, if there are many Spark programs that are not well optimized, a lot of memory will be wasted, and some programs will have to wait in a queue. In this case, The data processed by Spark may be smaller than that processed by MapReduce. (We’ll cover Spark memory tuning in a future article series.)
Figure 4.
As for the last point of fault tolerance, the results of each operation in MapReduce are saved to disk, making it easy to recover from disk if the calculation goes wrong. Spark needs to recalculate data based on the INFORMATION in the RDD, which consumes certain resources. Spark provides two methods for fault recovery: Perform the preceding operations based on the data relationship. Checkpoint stores the data set to persistent storage. In theory, if you add CheckPoint to each small step, Spark’s fault-tolerant performance can be as robust as MR’s. Of course, very few people do this.
We use Spark to compare with MapReduce. See how Spark improves on the limitations of MapReduce, as well as its fast, universal nature. The following is a look at Spark’s design philosophy and implementation process to illustrate why it can do these things.
Basic principles of Spark
Figure 5
As shown in Figure 5, a node acts as the driver to create SparkContext in the Spark cluster. The entry of the Spark application is responsible for scheduling computing resources and coordinating executors on Worker nodes. Workrs are generated based on the parameters entered by the user. A workr node runs several executors. An executor is a process that runs its own task, and each task executes the same piece of code to process different data.
Figure 6.
As shown in figure 6 is the specific implementation process of the Spark, the client to submit a job, and through reflection invoke execute user code the main function, then began SparkContext CoarseGrainedExecutorBackend and initialization.
SparkContext initialization includes the initialization of monitor page SparkUI, execution environment SparkEnv, SecurityManager SecurityManager, stage division and scheduler DAGScheduler, task job scheduler CoarseGrainedSchedulerBackend TaskSchedulerImpl, communicate with Executor of scheduling end.
DAG Scheduler divides jobs and then submits the taskSet corresponding to the stage to the TaskSchedulerImpl in turn. TaskSchedulerImpl will submit the taskset to the driver side CoarseGrainedSchedulerBackend back end, Then will a a LaunchTask CoarseGrainedSchedulerBackend. In distant CoarseGrainedExecutorBackend receives the task submitted after the event, will be called task Executor implementation, the ultimate task is run in TaskRunner run method.
So how does the DAG Scheduler divide jobs in Process 4? What if stages, tasks, and so on are generated for Executor to execute? Next, let’s look at an example of job partition execution.
Figure 7.
Figure 7 shows A Spark program that reads data from HDFS to generate RDD-a, then flatmap to RDD-b, read another part of data to RDD-C, then map to RDD-D, and rdD-D to aggregate rDD-e. After rDD-b and RDD-e are added, rDD-F is obtained, and the result is stored in the HDFS.
Spark can be divided into four stages based on different RDD dependencies. Stage0 and Stage2 can be executed in parallel because they have no dependencies. Stage2 needs to wait for Stage1 to complete. The aggregation of RDD-D to RDD-F and the join of RDD-B and RDD-E obtained by Stage0 and Stage2 to RDD-F will produce shaffle. Stages without dependencies can be executed in parallel, but for jobs, Spark is executed sequentially. If you want to execute jobs in parallel, multithreaded programming can be performed in Spark.
In this DAG diagram, Spark can fully understand the relationship between data. In this way, when some tasks fail, the RDD that fails can be calculated again based on the relationship.
*
Wide dependence and narrow dependence
Narrow dependency means that each partition of the parent RDD is used by only one partition of the RDD, and a child RDD partition usually corresponds to a constant number of parent RDD partitions.
Wide dependency means that each partition of the parent RDD may be used by multiple child RDD partitions, and child RDD partitions usually correspond to all parent RDD partitions. This concept is covered in the following examples.
Spark provides a variety of operators and common operations. So how does this approach of dividing jobs and performing parallel computing enable Spark to produce the fast results of in-memory computing? It is said that Spark is good at iterative computation, so let’s compare MapReduce with a classic iterative problem PageRank algorithm.
Figure 8, via http://www.jos.org.cn/jos/ch/reader/create_pdf.aspx?file_no=5557&journal_id=jos
Figure 8 is an iterative process of MapReduce’s Pagerank algorithm. It should be noted that the gray part is the data that needs to be stored to disk.
Figure 9, via http://www.jos.org.cn/jos/ch/reader/create_pdf.aspx?file_no=5557&journal_id=jos
Figure 9 shows an iteration of the pageRank algorithm executed by Spark, with many improvements over MapReduce. First, when the memory is sufficient, Spark allows users to cache commonly used data into the memory, which speeds up the system running. Second, Spark has a clear division of data dependencies. Tasks are scheduled based on the wide and narrow dependencies to achieve pipelined-based operations and improve system flexibility.
Figure 10: graphs for the second iteration of pagerank algorithm, via http://www.jos.org.cn/jos/ch/reader/create_pdf.aspx?file_no=5557&journal_id=jos
Figure 11: the Spark for the second iteration of pagerank algorithm, via http://www.jos.org.cn/jos/ch/reader/create_pdf.aspx?file_no=5557&journal_id=jos
As shown in the figure, Spark can allocate the RDD partition with narrow dependency to a task for pipelination. data within the task does not need to be transmitted over the network and tasks do not interfere with each other. Therefore, there are only three shuffling times in the two iterations of Spark.
The performance of MapReduce and Spark may not differ significantly over the course of an iteration, but as the number of iterations increases, the difference becomes apparent. The task scheduling policy adopted by Spark based on the dependency relationship significantly reduces the number of shuffle operations compared with That of MapReduce. Therefore, Spark is suitable for iterative operations.
This article introduces Spark from the perspectives of concept, features, and principles. The next article will introduce the operation process and mechanism of Spark on Yarn in detail.
Attached: Glossary of Spark