This is the 9th day of my participation in the August More Text Challenge. For details, see:August is more challenging
1. The description of the Spark
1.1 background
A big data parallel computing framework based on in-memory computing that can be used to build large, low-latency data analysis applications.
One of the three most important distributed computing system open source projects of Apache Software Foundation (Hadoop, Spark, Storm)
1.2 the characteristics of
- Fast running speed
Spark has a DAG execution engine and supports iterative data calculation in memory. Official data shows that data can be more than 10 times faster than Hadoop MapReduce if read from disk and more than 100 times faster if read from memory.
- Ease of use is good
Spark not only supports applications written in Scala, but also languages such as Java and Python. In particular, Scala is an efficient and extensible language that can handle complex processing tasks with simple code.
- Strong commonality
- Run anywhere
1.3 Usage Trend
Google’s big data analytics app usage trends
2. Spark ecosystem
2.1 Comparison between Spark and Hadoop
2.2 the Job
Hadoop: A MapReduce program is a Job. A Job contains one or more tasks, which are divided into Map tasks and Reduce tasks
Spark: The concept of Job is different from That of Hadoop. An Application is associated with a SparkContext. Each Application can have one or more jobs running in parallel or serial. A Job is triggered by an Action. A Job contains multiple stages, which are divided by Shuffle. Each Stage contains a Task Set composed of multiple tasks.
2.3 rate of fault tolerance
- Spark has better fault tolerance than Hadoop:
Spark introduces the abstraction of an elastic distributed dataset (RDD). These datasets are elastic, and if a part of the dataset is lost, they can be reconstructed according to their “pedigrees” (that is, allowing data-derived processes).
- In addition, during RDD calculation, CheckPoint can be used to achieve fault tolerance. CheckPoint has two methods:
By CheckPoint Data and Logging The Updates, The user can control which method is used to implement fault tolerance.
2.4 general
Spark is also more versatile than Hadoop:
-
Hadoop provides only Map and Reduce operations.
-
Spark provides a variety of data set operations, including Transformation and Action:
- Transformation Includes Map, Filter, FlatMap, Sample, GroupByKey, ReduceByKey, Union, Join, Cogroup, MapValues, and Sort , Count, and PartionBy.
- Action Includes the Collect, Reduce, Lookup, and Save operations.
- In addition, unlike Hadoop, the communication model between each processing node no longer has only a Shuffle mode. Users can name, materialize, and control the storage and partitioning of intermediate results
2.5 Practical Application
In practical applications, big data processing mainly includes the following three types:
- Complex batch data processing: typically time spans from tens of minutes to hours
- Interactive queries based on historical data: typically time spans from tens of seconds to minutes
- Data processing based on real-time data streams: typically time spans from hundreds of milliseconds to several seconds
At present, there are mature processing frameworks for the above three scenarios.
- The first case is ok
Hadoop graphs
To process massive amounts of data in bulk, - The second case is ok
Impala
Make interactive queries, - It works for the third case
Storm
A distributed processing framework processes real-time streaming data.
Cost issues:
- The above three are relatively independent, and the maintenance cost of each set is relatively high, which may bring some problems:
- Input and output data cannot be seamlessly shared between different scenarios, and data format conversion is usually required
- Different software requires different development and maintenance teams, resulting in higher use costs
- It is difficult to coordinate and allocate resources uniformly for each system in the same cluster
Spark has emerged as a one-stop platform to meet these needs
2.6 Application scenarios of Spark Ecosystem Components
Application scenarios | The time span | Other frameworks | Components in the Spark ecosystem |
---|---|---|---|
Complex batch data processing | Hour class | Graphs, Hive | Spark |
Interactive query based on historical data | Minutes, seconds | Impala, Dremel, Drill | Spark SQL |
Data processing based on real-time data stream | Milliseconds and seconds | Storm, S4, | Spark Streaming |
Data mining based on historical data | – | Mahout | MLlib |
Processing of graph structure data | – | Pregel, Hama | GraphX |
2.7 Spark component
2.7.1 Spark Core
- This section describes basic Spark content, including task scheduling, memory management, and fault tolerance.
- RDDs(elastic distributed data sets) is defined in the Spark Core. RDDs represent data sets that span many work nodes and can be processed in parallel.
- The Spark Core provides a number of APIs for creating and manipulating these collection RDDs
2.7.2 Spark SQL
- Spark A library that processes structured data. It supports querying data through SQL. It is like Hive SQL (HQL) and supports many data sources, such as Hive tables and JSON.
- Shark is an older Spark based SQL project that was modified based on Hive, which has been replaced by Spark-SQL.
2.7.3 Spark Streaming
- Real-time data stream processing component, similar to Storm
- Spark Streaming provides apis to manipulate live Streaming data.
2.7.4 MLlib
- Spark has a general machine learning package called MLlib(Machine Learning Lib).
- MLlib includes classification, clustering, regression, collaborative filtering algorithms, as well as module evaluation and data import.
- It also provides some low-level machine learning primitives, including general gradient descent optimization algorithms.
- In addition, scale-out across clusters is supported.
2.7.5 Graphx
- Is a library for processing graphs and parallel computation of graphs. Like Spark Streaming and Spark SQL, Graphx inherits the Spark RDD API and allows the creation of directed graphs.
- Graphx provides various graph operations, such as Subgraph and mapVertices, as well as common graph algorithms, such as PangeRank.
2.7.6 Cluster Managers
- Cluster Managers are Cluster Managers. Sparkl can run on many cluster managers, including Hadoop YARN, Apache Mesos, and Spark’s independent scheduler.
- If you have Hadoop Yarn or Mesos clusters, Spark’s support for these cluster management tools enables Spark applications to run on these clusters.
3. Spark running architecture
3.1 Basic Concepts
- RDD: Short for ResillientResillient Distributed DatasetDistributed Dataset (Elastic Distributed Dataset). It is an abstract concept of Distributed memory and provides a highly restricted shared memory model
- DAG: Short for Directed Acyclic Graph, which reflects the dependency between RDD’s
- Executor: a process running on a WorkerNode that runs tasks
- Application: Spark Application written by users
- Task: A unit of work that runs on Executor
- Job: A Job contains multiple RDD and various operations on the CORRESPONDING RDD
- Stage: is the basic scheduling unit of a Job. A Job is divided into multiple groups of tasks. Each group of tasks is called Stage or TaskSet, which represents a group of associated tasks that have no Shuffle dependency on each other
3.2 Architecture Design
- The Spark running architecture includes the Cluster Manager, Worker Node for running job tasks, Driver for each application, and Executor for executing specific tasks on each work Node.
- The resource manager can be installed with Mesos or YARN
- Compared with the Hadoop MapReduce computing framework, the Executor used by Spark has two advantages:
- The first is to use multithreading to perform specific tasks to reduce the task startup overhead
- Second, Executor has a BlockManager storage module, which uses memory and disk as storage devices, effectively reducing IO overhead
-
An Application consists of one Driver and several jobs, a Job consists of multiple stages, and a Stage consists of multiple tasks that have no Shuffle relationship
-
When an Application is executed, the Driver requests resources from the cluster manager, starts an Executor, sends Application code and files to the Executor, and then executes tasks on the Executor. After the execution is complete, the execution result is returned to the Driver or written to HDFS or other databases
3.3 Spark Basic Process
- First, the basic running environment is built for the application. That is, the Driver creates a SparkContext to apply for resources, assign tasks, and monitor them
- The resource manager allocates resources to Executor and starts the Executor process
- SparkContext builds DAG graph according to the dependency of RDD. DAG graph is submitted to DAGScheduler and parsed into Stage. Then each TaskSet is submitted to the underlying Scheduler, which is processed by Task Scheduler. Executor applies for tasks from SparkContext. The Task Scheduler issues tasks to Executor to run and provides application code
- Tasks run on Executor, feed back to the Task Scheduler, and then to the DAG Scheduler, and then write data and release all resources
Features:
- each
Application
They all have their ownExecutor
Process, and the process inApplication
Stay during run.Executor
Processes run in a multi-threaded mannerTask
Spark
The running process is independent of the resource manager, as long as it is availableExecutor
Process and maintain communicationTask
Optimization mechanisms such as data localization and speculative execution are adopted
3.4 Working Principles of Spark
3.4.1 track RDDs
- Resilient Distributed Datasets (RDDs)
- RDDs is the basic abstract class of Spark for distributing data and computing. Yes
Spark
Core concepts; - In Spark, all computations are passed
RDDs
The creation, transformation, and operation of. - RDDs has
lineage graph
(Pedigree Chart) - An RDD is an immutable distributed collection object with many inside
partitions
Each partition contains a subset of data, thesepartitions
Computations can be performed on different nodes in the cluster; - Partition is
Spark
Parallel processing unit in.
3.4.2 Working Principle of the RDD
- The RDD provides a rich set of operations to support common data operations
"Action Action
and"Convert the Transformation
Two types of - The transformation interfaces provided by the RDD are very simple and similar
map
、filter
,groupBy
、join
Rather than fine-grained modifications to a data item (not suitable for web crawlers) - On the surface, RDD functions are limited and not powerful enough
RDD
Has been proven to efficiently express many framework programming models such asMapReduce
,SQL
,Pregel
- Spark is implemented in the Scala language
RDD
的API
, the programmer can callAPI
Implementation ofRDD
Various operations of
The typical RDD execution sequence is as follows:
- RDD is created by reading an external data source
- The RDD undergoes a series of transformations. Each Transformation generates a different RDD for the next Transformation
- The last RDD is transformed by the action operation and output to an external data source
- This sequence of processing is called a Lineage, the result of DAG topological sequencing
Doubling and actions of the RDD
Click here to
RDD running process:
- Create an RDD object.
- SparkContext is responsible for calculating the dependencies between THE RDD and building the DAG
- DAGScheduler is responsible for dividing the DAG diagram into multiple stages. Each Stage contains multiple tasks. Each Task is distributed by the TaskScheduler to each Executor on each WorkerNode for execution.
Rule 3.4.3 Scala
Scala is a modern multiparadigm programming language that runs on the Java platform (JVM Java Virtual Machine) and is compatible with existing Java programs
Scala features:
- Scala’s strong concurrency and support for functional programming make it better for distributed systems
- Scala has a concise syntax that provides elegance
API
- Scala compatible
Java
, run fast, and can merge intoHadoop
In the ecosystem - Scala is the primary programming language for Spark, but
Spark
Also supportsJava
、Python
、R
As a programming language - The advantage of Scala is that it provides the REPL
Read Eval Print Loop
, interactive interpreter), improve the efficiency of program development
4. SparkSQL
At the Hive compatibility level, Spark SQL relies only on HiveQL parsing and Hive metadata. That is, Spark SQL takes over all HQL parsing into an abstract syntax tree (AST). Catalyst is responsible for Spark SQL execution plan generation and optimization
- Spark SQL adds SchemaRDD (RDD with Schema information) to enable users to execute SQL statements in Spark SQL. Data can be from RDD, external data sources such as Hive, HDFS, and Cassandra, or in JSON format
- Spark SQL supports Scala, Java, and Python, and supports THE SQL-92 specification
5. Spark programming practices
5.1 Programming Environment
- Operating system: Linux (Ubuntu18.04 or Ubuntu16.04 recommended);
- Hadoop version: 3.1.3 or 2.7.1
- JDK version: 1.8;
- Hadoop pseudo-distributed configuration
- Spark 2.4.8 or self-compiled version
- Scala 2.11.8 or 2.8.0
5.2 Experimental Steps:
5.2.1 Spark Environment configuration
- Check the Java environment and the Hadoop environment.
2. Download the installation packageScala:www.scala-lang.org/download/al…The Spark:Spark.apache.org/downloads.h…Choose a Package Type on the Download page of Spark:
- Source Code: Spark Source Code, which can be used only after compilation. You can set compilation options freely.
- Pre-build with user-provide Hadoop: Belongs to the Hadoop free version, can be applied to any Hadoop version;
- Pre-build for Hadoop2.7 and pre-build for Hadoop 2.6: Based on the pre-compiled version of Hadoop2.7 and 2.6 respectively, they must be used in accordance with the Hadoop version installed on the local computer.
- Pre-build with Scala2.12 and User Provided Apache Hadoop: A pre-compiled version that includes Scala2.12 and can be used with any Hadoop version.
- Install scala
Decompress the installation package (sudo tar -zxvf scala-2.11.8.tgz -c /usr/local/) and change scala
The owning user and user group are the current user and its group.
Configure environment variables: Add
Add the corresponding bin directory to the PATH variable.
Make the environment effective
Check whether the installation is successfulAlready done!
- Install the spark
Decompress the installation package (sudo tar -zxvf spark-2.4.8-bin-without-hadoop. TGZ -c /usr/local/), change the owning user and user group, and rename the directory to spark-2.4.8 to facilitate subsequent configuration.
Change the owning user and user group
Rename the directory to Spark-2.4.8
Configure the environment variables, add the SPARK_HOME variable, and add the corresponding bin directory to the PATH variable. Export SPARK_HOME = / usr/local/spark – from 2.4.8 export PATH =
SPARK_HOME/bin
Spark configuration file configuration:
Copy the spark-env.sh.template file to the spark-env.sh file:
The configuration is as follows:
Start Spark: Before starting Spark, start the HDFS
After the Spark workers are started, access Master:8080 to view the status of Spark workers.
Spark-shell Displays the Spark shell
There will be such mistakes
But there’s no need to panic! It does not affect use in Scala, and you can solve this by adding system environment variables. export TERM=xterm-color
There wouldn’t be
1.5 For example, run the Spark self-delivered example through the spark-submit command. The spark bootstring examples are provided in SPARK_HOME/examples/jars/spark-examples_2.11-2.4.8.jar:
spark-submit --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_211.2 -4.8.jar
Copy the code
Note: When running a SparkPi instance, many run logs are generated. You can filter the logs by adding the grep command to display the information you care about:
5.2.2 Scala Code written in Spark Shell
(1) Respectively fromThe local file
,Files in the HDFS
As well asParallelized () method of Spark Context
RDD_1, RDD_2, and RDD_3 are generated respectively. The local file format must be multiple words per line, separated by Spaces. In the HDFS, each line contains one word, that is, each word is separated by a newline. Each RDD must contain one or more of your student number or name. 1.1 Creating in.txt locallyWrite contentUploaded to spark1.2 Creating the in0.txt file locallyWrite dataUpload the file to the HDFSCheck whether the upload is successfulUploaded to spark
1.3 Creating files on Spark
Creation success!
(2) output the first line of RDD_1, all contents of RDD_2, and the maximum value of RDD_3; 2.1 First line of RDD_12.2 All contents of RDD_2
2.3 Maximum value of RDD_3
(3) Count the occurrence times of “name pinyin” and “student number” in RDD_1;
Results: There were 6 ZQC and 4 031904102
(4) Delete RDD_2 from the deduplicated RDD_1;
(5) Merge the above results with RDD_3, and write RDD_3 to the local file system and HDFS file system respectively.Check whether the entry is successful(6) Write Scala code to write any content to HDFS. The file path is customized. The file is named with “student ID – name pinyin.txt”.
So let’s create a fileCheck the HDFS
5.2.3 Writing A Scala independent Application:
Spark programs written in Scala need to be compiled and packaged using SBT. Spark does not have an SBT and needs to be installed separately. You can go to the official website to download the SBT installation file, the latest version
Download good
Create a directory
Run the following command to install SBT in the /usr/local/sbt directory:
Copy sbt-launch.jar from the bin directory to the SBT installation directory
Create a new file and write the following
#! /bin/bash
SBT_OPTS="-Xms512M -Xmx1536M -Xss1M -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=256M"
java $SBT_OPTS -jar `dirname $0`/sbt-launch.jar "$@"
Copy the code
After the file is saved, you need to add the executable permission for the Shell script file. Then, you can run the SBT sbtVersion command to view the SBT version information:It’s done. It’s a little slow!
(1) Implement the wordCount function and write the result to a local file; Create a directory locally
Create this file.
Write data.
Check the directory structure(2) Package the above programs with SBT respectively;(3) Execute the generated JAR through spark-submit.
- Writing a Scala standalone application:
-
The implementation generates any RDD and writes the result to a file;Copy the code
Rename and set the permission group
On the terminal, run the following command to create a folder spark_zqc_maven_scala as the application root directory:
Write the following
(2) Package the above programs separately with Maven; The program relies on the Spark Java API, so we need to use Maven to compile the package. Create a new file pom. XML in the./spark_zqc_maven_scala directory, and then add the following contents to the pom. XML file to declare the information about the independent application and its dependency with Spark:
To ensure that Maven is working properly, first check the file structure of the entire application by executing the following command,
Next, we can package the entire application as a JAR with the following code (note: The computer needs to stay connected to the network, and Maven automatically downloads the dependencies when you first run the package command, which takes a few minutes) :
(3) Execute the generated JAR through spark-submit.
The last
Xiao Sheng Fan Yi, looking forward to your attention.