First, let’s look at the framework of big data technology from a macro perspective:

  


Figure 1. Big data technology framework

As can be seen from Figure 1, data source – data collection – data storage – resource management is the basic of data analysis and processing. The computing framework in the figure includes batch processing, interactive analysis, and stream processing:

Batch calculation: no strict requirements on time, high throughput

Interactive computing: Support SQL – like language, fast data analysis

Streaming computing: Data flows into the system and needs to be processed and analyzed in real time

The real-time requirements of different computing frameworks are gradually increasing. Spark is a layer 4 computing framework in the whole big data technology framework. Spark can meet these three requirements well, which is one of the reasons why Spark is so popular. After computing the data, we can analyze it, train machine learning models, and so on. Visualize and display the results of data analysis or training to users or provide some intelligent services to gain business value.

Spark Ecosystem

  


Figure 2 Spark ecosystem

Spark data is stored in distributed storage systems, such as HDFS and Hbase

The HDFS splits files into equally large data blocks (128MB by default) and stores them on multiple machines. It is a disk with a large capacity and high fault tolerance. The usual architecture is a NameNode(storing metadata) with multiple Datanodes. To prevent NameNode breakdown, there is a Standby NameNode: Standby NameNode.

  


Figure 3 HDFS architecture

Spark asset management and scheduling Using YARN, Spark can run on YARN. YARN can centrally manage and schedule various types of applications. Spark has four main modules: Spark SQL, Spark Streaming, Spark Graphx Graph-Parallel, MLlib.

Spark was created to address the limitations of the MapReduce framework, but not in every case.

Limitations of the MapReduce framework

Only Map and Reduce operations are supported. Spark supports rich apis of Transformation and Action

Low processing efficiency

When the calculation logic is complex, it is converted into multiple MapReduce jobs, and each MapReduce job repeatedly reads and writes disks. Disk I/O is a time-consuming operation: Map writes intermediate results to disk, Reduce writes HDFS, and multiple Mrs Read HDFS to exchange data (** Why disk? **MapReduce was introduced around 2004, when memory was expensive, so a good solution for storing a lot of space was to use disks, but now memory costs about the same as disks, and disks cost about the same as tapes.)

Task scheduling and startup costs a lot

Memory cannot be fully utilized

Sorting is required on both the Map and Reduce ends

Not suitable for iterative computing (such as machine learning, graph computing, etc.), interactive processing (data mining) and streaming processing (click log analysis)

MapReduce programming isn’t flexible enough, so it’s time to try the Scala functional programming language

Moreover, there are various big data computing frameworks, each of which is its own:

Batch processing: MapReduce, Hive, Pig

Streaming :Storm

Interactive computing :Impala

Spark, on the other hand, can perform batch processing, streaming computing, and interactive computing simultaneously, reducing users’ learning costs.

Characteristics of the spark

Efficient (10 to 100 times faster than MapReduce)

The memory computing engine provides the Cache mechanism to support iterative calculation or multiple data sharing, reducing the I/O cost of data reading. It is worth noting that Spark does not write all data to the memory. Instead, Spark writes data to disks by default, but Spark can write data to the memory and calculate data.

Using the DAG(Directed acyclic graph) engine reduces the overhead of writing intermediate results to HDFS between multiple computations

Use the multi-threaded pool model to reduce task startup and opening, avoid unnecessary sort operations during shuffle, and reduce disk I/O operations

Easy to use

Provides a rich API, support Java, Scala, Python and R four languages

Two to five times less code than MapReduce

Perfect integration with Hadoop

Read/write HDFS/Hbase

Can be integrated with YARN

Spark Core Concepts

RDD Resilient Distributed Datasets RDD Resilient Distributed Datasets RDD Resilient Distributed Datasets RDD Resilient Distributed Datasets

Distributed: The user sees a collection of RDD, but the background is a collection of read-only objects distributed across a cluster (consisting of multiple partitions)

Flexibility: Data can be stored on disk or in memory (multiple storage levels)

  


Figure 4 RDD

There are two basic operations in RDD: Transformation and Action

Transformation: Constructs a new RDD using the Scala collection or Hadoop data set, and generates a new RDD using the existing RDD, such as Map, filter, groupBy, and reduceBy. Transformation is lazy, that is, only the RDD Transformation relationship is recorded and the actual calculation is not triggered until an action is encountered.

Action: Calculates one or a group of values using RDD, for example, count, reduce, and saveAsTextFile. Action triggers distributed execution of the program

  


Figure 5 RDD lazy execution lines

Spark RDD Cache

The main features of Spark are as follows: High efficiency is due to the use of Cache mechanism for iterative calculation, and flexibility of RDD is also due to data can be stored in memory or on disk. Spark RDD Cache allows RDD to be cached in memory or on disk for reuse:

Val data = sc. TextFile (” HDFS: / / nn: 8020 / input “)

Data.cache () // actually data.persist(storagelevel.memory_only)

//data.persist(StorageLevel.DISK_ONLY_2)

Spark Program Architecture

Looking at the architecture of the next Spark program, the main function of each program is composed of two types of components: Driver and Executor. The main function runs in a Driver. A Driver can be converted into multiple tasks, and each Task can be scheduled to run in the specified Executor.

  


Figure 6 Spark program architecture

Running mode of Spark

There are three modes of spark. You can specify spark. Master in $SPARK_HOME/conf/spark-defaults.conf.

Local: The Spark application runs locally in multi-threaded mode, facilitating debugging

Local: Starts only one executor

Local [K]: starts K executors

Local [*]: starts the same number of executors as cpus

Standalone (Standalone mode) : Runs independently in a cluster, master-slave

YARN/ mesOS: runs on the resource management system

Yarn-client: The Driver runs locally, but the Executor runs on yarn. If the Driver is down, it needs to be restarted with error tolerance

Yarn-cluster: A Driver runs in a cluster (NodeManager). Once the Driver is down, it automatically finds another NodeManager to start the Driver. Yarn-cluster is fault-tolerant

Yarn-client internal processing logic

In yarn-client mode, the Driver runs on the client and the Spark program is submitted on the same machine that the Driver runs on. That is, the main function runs on the same machine. What happens when you submit your program to Yarn?

Assume that there are four servers in a YARN cluster, one of which is Resource Manager, and the other three are Node Manager. When Node Manager is started, the information on Node Manager is registered with Resource Manager. After deploying the Hadoop environment locally, you can submit spark (1), and Resource Manager will start a Node Manager(2). Node Manager launches your Application Master(3). The Application Master needs resources specified by the program you are running. The Application Master starts and communicates with the Resource Manager to request resources for executors (4). If an executor dies, The Application Master will re-apply for an executor from the Resource Manager and start it on a Node Manager. If the Node Manager is down, The Application Master requests the same number of executors from the Resource Manager and starts them. Once the executor has the resource, the Application Master communicates with the Node Manager to start the executor(5, 6). Each executor communicates with the Driver on the Client to fetch the task(dotted line).

  


Figure 6 Program operation mode :YARN Distributed mode (yarn-client)

Resource Manager: A central administrative service that determines which applications can start executor processes, when and where

Node Manager: A slave service running on each Node that actually starts the Executor process and monitors whether the process is alive and resource consumption

Application Master: In YARN, each Application has an Application Master process, which is the first container to start the Application. This process requests resources from ResourceManager and sends instructions to NodeManager to start the container after resource allocation is complete.