One: What is Spark

Apache Spark is a fast, general-purpose computing engine designed for large-scale data processing. It was developed by Algorithms, Machines, and People Lab at the University of California, Berkeley. Can be used to build large, low-latency data analysis applications. In June 2013, Spark entered Apache as an incubator project. Eight months later, It became an Apache Top project, which is remarkable for its speed. With its advanced design concept, Spark quickly became a popular project in the community.

Spark is implemented in Scala, an object-oriented, functional programming language that can manipulate distributed data sets as easily as local collection objects. (Scala provides a parallel model called actors, where actors send and receive asynchronous information through their inboxes rather than sharing data, This approach is called the Shared Nothing model



Official website:

http://spark.apache.org/


Official documents:

http://spark.apache.org/docs/latest/index.html


Source code address:

https://github.com/apache/spark

2: Spark features



1: Fast running speed


Spark has a DAG execution engine that supports iterative computation of data in memory. If the data is read from disk, it is more than 10 times faster than Hadoop, and if it is read from memory, it can be more than 100 times faster.

2: Easy to use


Spark supports not only Scala but also Java and Python for application programming. Scala is an efficient and extensible language that can handle complex processing tasks with simple codes.

3: versatility


Spark provides a number of libraries, including SQL, DataFrames, MLlib, GraphX, and Spark Streaming. Developers can combine these libraries seamlessly in the same application.





4: Supports multiple resource managers


Spark supports Hadoop YARN, Apache Mesos, and its own independent cluster manager

3. Introduction to the Spark ecosystem



The Spark Ecosystem, also known as BDAS (Berkeley Data Analytics Stack), is a platform created by Berkeley’S APMLab to showcase big data applications through large-scale integration among Algorithms, Machines, and People. Take a picture of his ecosystem

1: the Spark to the Core

Spark Core implements basic Spark functions, including task scheduling, memory management, error recovery, and interaction with the storage system. It also includes API definitions for elastic distributed Data sets (RDD), which represent collections of elements distributed across multiple computer nodes that can be operated on in parallel and are Spark’s main programming abstraction

2: the Spark SQL

In essence, Hive HQL parsing is performed to translate HQL into RDD operations on Spark, and Hive metadata is used to obtain table information in the database. Data and files in the HDFS are obtained by Shark and calculated on Spark. Spark SQL supports multiple data sources, such as Hive and JSON.

3: the Spark Streaming

SparkStreaming is a high-throughput, fault-tolerant streaming system for real-time data streams that can perform complex operations like Map, Reduce and Join on multiple data sources such as Kdfka, Flume, Twitter, Zero and TCP sockets. Save the results to an external file system, a database, or apply them to a real-time dashboard. Spark Streaming provides an API for manipulating data streams and is highly compatible with the RDD API, greatly reducing the barrier and cost of learning and development.

Spark Streaming is the decomposition of Streaming computation into a series of short batch jobs. The batch engine here is Spark Core, which means that the input data of Spark Streaming are divided into discrete Stream according to the Batch size (e.g., 1 second). Resilient Distributed Dataset (RDD) in Spark. Then change the Transformation to DStream in Spark Streaming to the Transformation to RDD in Spark, and save the intermediate result of the RDD in the memory. The entire streaming computing can overlay intermediate results or store them to external devices, depending on the needs of the business.







4: MLlib


MLlib is a machine learning library that provides a variety of algorithms for classification, regression, clustering, collaborative filtering, and more on clusters. Some of these algorithms can also be applied to streaming data, such as using ordinary least squares or K-means clustering (and more) to calculate linear regression. Apache Mahout, a machine learning library for Hadoop, has been moved from MapReduce to Spark MLlib

5: GraphX


GraphX is an API for parallel graph and graph computation in Spark. It can be considered as a rewrite and optimization of GraphLab(C++) and Pregel(C++) on Spark(Scala). Compared with other distributed graph computing frameworks, GraphX’s biggest contribution is: Spark provides a stack of data solutions to facilitate and efficiently complete a complete set of flow jobs for graph computation. GraphX started as a distributed graph computing framework project from Berkeley AMPLAB, and was later integrated into Spark as a core component.

Four: Application scenarios of Spark



1: Complex Batch Data Processing focuses on the ability to process massive Data. As for the tolerable Processing speed, the usual time may be tens of minutes to hours (MapReduce computing similar to Hadoop).


2: Data volume is not particularly large, but it requires real-time statistical analysis (real-time computing)


3: Spark is a memory-based iterative computing framework suitable for applications that require multiple operations on specific data sets. The greater the number of repeated operations, the greater the amount of data to be read, and the greater the benefit


The current and official recommended usage model is this

Five: Running mode of Spark



1: The Local mode is commonly used for Local development tests. The Local mode is also divided into Local single-thread and local-cluster multi-threading.



Standalone mode Typical of Mater/slave mode, but it is also clear that the Master has a single point of failure. Spark supports ZooKeeper to implement HA.



3: On YARN Cluster mode Yarn runs On the YARN resource manager framework. Yarn manages resources, and Spark schedules and calculates tasks



4: On MESOS Cluster mode Runs On the MESOS resource manager framework. Mesos is responsible for resource management, and Spark is responsible for task scheduling and computing



6. Basic principles of Spark



The Spark running framework is shown in the following figure. The Cluster Manager and Worker Node run the job, followed by the Driver, the task control Node of each application, and the Executor on each machine Node that has specific tasks.


First, the Driver program starts multiple workers, who load data from the file system and generate RDD (that is, data is put into RDD, which is a data structure), and Cache it into memory according to different partitions.

7: Spark RDD introduction



RDD is one of the core contents of Spark (before version 2.0). RDD in Chinese stands for Resilient Distributed Datasets, and the object is dataset, that is, the in-memory database. RDD is read-only and partitioned; all or part of this data set can be cached in memory and reused across multiple computations. The so-called elasticity refers to the ability to swap with disks when the memory is insufficient. This refers to another feature of RDD: in-memory computing, which is storing data in memory. In addition, Spark provides us with maximum freedom to solve the memory capacity limitation problem. We can set the cache for all data, including whether to cache and how to cache.

8: The Spark task is submitted



Spark-submit submission can specify various parameters

./bin/spark-submit \


–class\


–master\


–deploy-mode\


–conf=\


. # other options


\


[application-arguments]

The parameters are described as follows:


–class: the entry method of a Spark task, usually the main method. Such as: org. Apache. Spark. Examples. SparkPi)


-master: indicates the master URL of the cluster. Such as

The spark: / / 127.0.0.1:7077


–deploy-mode: specifies the deployment mode, which can be cluster or client. The default deployment mode is client


–conf: Additional attributes


Application-jar: Specifies the JAR directory whose path must be visible throughout the cluster


Application-argument: argument of the main method



Official more detailed parameter description:

http://spark.apache.org/docs/latest/submitting-applications.html

For more technical information: Gzitcast