[Xiao Bai Perspective] Basic Practice of Big Data (vii) Basic operations of Spark

This is the 9th day of my participation in the August More Text Challenge. For details, see:August is more challenging

1. The description of the Spark

1.1 background

A big data parallel computing framework based on in-memory computing that can be used to build large, low-latency data analysis applications.

One of the three most important distributed computing system open source projects of Apache Software Foundation (Hadoop, Spark, Storm)

1.2 the characteristics of

Fast running speed

Spark has a DAG execution engine and supports iterative data calculation in memory. Official data shows that data can be more than 10 times faster than Hadoop MapReduce if read from disk and more than 100 times faster if read from memory.

Ease of use is good

Spark not only supports applications written in Scala, but also languages such as Java and Python. In particular, Scala is an efficient and extensible language that can handle complex processing tasks with simple code.

Strong commonality
Run anywhere

1.3 Usage Trend

Google’s big data analytics app usage trends

2. Spark ecosystem

2.1 Comparison between Spark and Hadoop

2.2 the Job

Hadoop: A MapReduce program is a Job. A Job contains one or more tasks, which are divided into Map tasks and Reduce tasks

Spark: The concept of Job is different from That of Hadoop. An Application is associated with a SparkContext. Each Application can have one or more jobs running in parallel or serial. A Job is triggered by an Action. A Job contains multiple stages, which are divided by Shuffle. Each Stage contains a Task Set composed of multiple tasks.

2.3 rate of fault tolerance

Spark has better fault tolerance than Hadoop:

Spark introduces the abstraction of an elastic distributed dataset (RDD). These datasets are elastic, and if a part of the dataset is lost, they can be reconstructed according to their “pedigrees” (that is, allowing data-derived processes).

In addition, during RDD calculation, CheckPoint can be used to achieve fault tolerance. CheckPoint has two methods:

By CheckPoint Data and Logging The Updates, The user can control which method is used to implement fault tolerance.

2.4 general

Spark is also more versatile than Hadoop:

Hadoop provides only Map and Reduce operations.
Spark provides a variety of data set operations, including Transformation and Action:
1. Transformation Includes Map, Filter, FlatMap, Sample, GroupByKey, ReduceByKey, Union, Join, Cogroup, MapValues, and Sort , Count, and PartionBy.
2. Action Includes the Collect, Reduce, Lookup, and Save operations.
3. In addition, unlike Hadoop, the communication model between each processing node no longer has only a Shuffle mode. Users can name, materialize, and control the storage and partitioning of intermediate results

2.5 Practical Application

In practical applications, big data processing mainly includes the following three types:

Complex batch data processing: typically time spans from tens of minutes to hours
Interactive queries based on historical data: typically time spans from tens of seconds to minutes
Data processing based on real-time data streams: typically time spans from hundreds of milliseconds to several seconds

At present, there are mature processing frameworks for the above three scenarios.

The first case is okHadoop graphsTo process massive amounts of data in bulk,
The second case is okImpalaMake interactive queries,
It works for the third caseStormA distributed processing framework processes real-time streaming data.

Cost issues:

The above three are relatively independent, and the maintenance cost of each set is relatively high, which may bring some problems:
Input and output data cannot be seamlessly shared between different scenarios, and data format conversion is usually required
Different software requires different development and maintenance teams, resulting in higher use costs
It is difficult to coordinate and allocate resources uniformly for each system in the same cluster

Spark has emerged as a one-stop platform to meet these needs

2.6 Application scenarios of Spark Ecosystem Components

Application scenarios	The time span	Other frameworks	Components in the Spark ecosystem
Complex batch data processing	Hour class	Graphs, Hive	Spark
Interactive query based on historical data	Minutes, seconds	Impala, Dremel, Drill	Spark SQL
Data processing based on real-time data stream	Milliseconds and seconds	Storm, S4,	Spark Streaming
Data mining based on historical data	–	Mahout	MLlib
Processing of graph structure data	–	Pregel, Hama	GraphX

2.7 Spark component

2.7.1 Spark Core

This section describes basic Spark content, including task scheduling, memory management, and fault tolerance.
RDDs(elastic distributed data sets) is defined in the Spark Core. RDDs represent data sets that span many work nodes and can be processed in parallel.
The Spark Core provides a number of APIs for creating and manipulating these collection RDDs

2.7.2 Spark SQL

Spark A library that processes structured data. It supports querying data through SQL. It is like Hive SQL (HQL) and supports many data sources, such as Hive tables and JSON.
Shark is an older Spark based SQL project that was modified based on Hive, which has been replaced by Spark-SQL.

2.7.3 Spark Streaming

Real-time data stream processing component, similar to Storm
Spark Streaming provides apis to manipulate live Streaming data.

2.7.4 MLlib

Spark has a general machine learning package called MLlib(Machine Learning Lib).
MLlib includes classification, clustering, regression, collaborative filtering algorithms, as well as module evaluation and data import.
It also provides some low-level machine learning primitives, including general gradient descent optimization algorithms.
In addition, scale-out across clusters is supported.

2.7.5 Graphx

Is a library for processing graphs and parallel computation of graphs. Like Spark Streaming and Spark SQL, Graphx inherits the Spark RDD API and allows the creation of directed graphs.
Graphx provides various graph operations, such as Subgraph and mapVertices, as well as common graph algorithms, such as PangeRank.

2.7.6 Cluster Managers

Cluster Managers are Cluster Managers. Sparkl can run on many cluster managers, including Hadoop YARN, Apache Mesos, and Spark’s independent scheduler.
If you have Hadoop Yarn or Mesos clusters, Spark’s support for these cluster management tools enables Spark applications to run on these clusters.

3. Spark running architecture

3.1 Basic Concepts

RDD: Short for ResillientResillient Distributed DatasetDistributed Dataset (Elastic Distributed Dataset). It is an abstract concept of Distributed memory and provides a highly restricted shared memory model
DAG: Short for Directed Acyclic Graph, which reflects the dependency between RDD’s
Executor: a process running on a WorkerNode that runs tasks
Application: Spark Application written by users
Task: A unit of work that runs on Executor
Job: A Job contains multiple RDD and various operations on the CORRESPONDING RDD
Stage: is the basic scheduling unit of a Job. A Job is divided into multiple groups of tasks. Each group of tasks is called Stage or TaskSet, which represents a group of associated tasks that have no Shuffle dependency on each other

3.2 Architecture Design

The Spark running architecture includes the Cluster Manager, Worker Node for running job tasks, Driver for each application, and Executor for executing specific tasks on each work Node.
The resource manager can be installed with Mesos or YARN
Compared with the Hadoop MapReduce computing framework, the Executor used by Spark has two advantages:
1. The first is to use multithreading to perform specific tasks to reduce the task startup overhead
2. Second, Executor has a BlockManager storage module, which uses memory and disk as storage devices, effectively reducing IO overhead

An Application consists of one Driver and several jobs, a Job consists of multiple stages, and a Stage consists of multiple tasks that have no Shuffle relationship
When an Application is executed, the Driver requests resources from the cluster manager, starts an Executor, sends Application code and files to the Executor, and then executes tasks on the Executor. After the execution is complete, the execution result is returned to the Driver or written to HDFS or other databases

3.3 Spark Basic Process

First, the basic running environment is built for the application. That is, the Driver creates a SparkContext to apply for resources, assign tasks, and monitor them
The resource manager allocates resources to Executor and starts the Executor process
SparkContext builds DAG graph according to the dependency of RDD. DAG graph is submitted to DAGScheduler and parsed into Stage. Then each TaskSet is submitted to the underlying Scheduler, which is processed by Task Scheduler. Executor applies for tasks from SparkContext. The Task Scheduler issues tasks to Executor to run and provides application code
Tasks run on Executor, feed back to the Task Scheduler, and then to the DAG Scheduler, and then write data and release all resources

Features:

eachApplicationThey all have their ownExecutorProcess, and the process inApplicationStay during run. ExecutorProcesses run in a multi-threaded mannerTask
Spark The running process is independent of the resource manager, as long as it is availableExecutorProcess and maintain communication
Task Optimization mechanisms such as data localization and speculative execution are adopted

3.4 Working Principles of Spark

3.4.1 track RDDs

Resilient Distributed Datasets (RDDs)
RDDs is the basic abstract class of Spark for distributing data and computing. YesSparkCore concepts;
In Spark, all computations are passedRDDsThe creation, transformation, and operation of.
RDDs haslineage graph(Pedigree Chart)
An RDD is an immutable distributed collection object with many inside partitionsEach partition contains a subset of data, thesepartitionsComputations can be performed on different nodes in the cluster;
Partition isSparkParallel processing unit in.

3.4.2 Working Principle of the RDD

The RDD provides a rich set of operations to support common data operations"Action Actionand"Convert the TransformationTwo types of
The transformation interfaces provided by the RDD are very simple and similar map 、 filter , groupBy 、joinRather than fine-grained modifications to a data item (not suitable for web crawlers)
On the surface, RDD functions are limited and not powerful enoughRDDHas been proven to efficiently express many framework programming models such asMapReduce,SQL ,Pregel
Spark is implemented in the Scala languageRDD 的 API, the programmer can callAPIImplementation ofRDDVarious operations of

The typical RDD execution sequence is as follows:

RDD is created by reading an external data source
The RDD undergoes a series of transformations. Each Transformation generates a different RDD for the next Transformation
The last RDD is transformed by the action operation and output to an external data source
This sequence of processing is called a Lineage, the result of DAG topological sequencing

Doubling and actions of the RDD

Click here to

RDD running process:

Create an RDD object.
SparkContext is responsible for calculating the dependencies between THE RDD and building the DAG
DAGScheduler is responsible for dividing the DAG diagram into multiple stages. Each Stage contains multiple tasks. Each Task is distributed by the TaskScheduler to each Executor on each WorkerNode for execution.

Rule 3.4.3 Scala

Scala is a modern multiparadigm programming language that runs on the Java platform (JVM Java Virtual Machine) and is compatible with existing Java programs

Scala features:

Scala’s strong concurrency and support for functional programming make it better for distributed systems
Scala has a concise syntax that provides elegance API
Scala compatible Java, run fast, and can merge intoHadoopIn the ecosystem
Scala is the primary programming language for Spark, butSpark Also supportsJava 、 Python 、 RAs a programming language
The advantage of Scala is that it provides the REPLRead Eval Print Loop , interactive interpreter), improve the efficiency of program development

4. SparkSQL

At the Hive compatibility level, Spark SQL relies only on HiveQL parsing and Hive metadata. That is, Spark SQL takes over all HQL parsing into an abstract syntax tree (AST). Catalyst is responsible for Spark SQL execution plan generation and optimization

Spark SQL adds SchemaRDD (RDD with Schema information) to enable users to execute SQL statements in Spark SQL. Data can be from RDD, external data sources such as Hive, HDFS, and Cassandra, or in JSON format
Spark SQL supports Scala, Java, and Python, and supports THE SQL-92 specification

5. Spark programming practices

5.1 Programming Environment

Operating system: Linux (Ubuntu18.04 or Ubuntu16.04 recommended);
Hadoop version: 3.1.3 or 2.7.1
JDK version: 1.8;
Hadoop pseudo-distributed configuration
Spark 2.4.8 or self-compiled version
Scala 2.11.8 or 2.8.0

5.2 Experimental Steps:

5.2.1 Spark Environment configuration

Check the Java environment and the Hadoop environment.

2. Download the installation packageScala:www.scala-lang.org/download/al…The Spark:Spark.apache.org/downloads.h…Choose a Package Type on the Download page of Spark:

Source Code: Spark Source Code, which can be used only after compilation. You can set compilation options freely.
Pre-build with user-provide Hadoop: Belongs to the Hadoop free version, can be applied to any Hadoop version;
Pre-build for Hadoop2.7 and pre-build for Hadoop 2.6: Based on the pre-compiled version of Hadoop2.7 and 2.6 respectively, they must be used in accordance with the Hadoop version installed on the local computer.
Pre-build with Scala2.12 and User Provided Apache Hadoop: A pre-compiled version that includes Scala2.12 and can be used with any Hadoop version.

Install scala

Decompress the installation package (sudo tar -zxvf scala-2.11.8.tgz -c /usr/local/) and change scala

The owning user and user group are the current user and its group.

Configure environment variables: Add $The SCALA_HOME variable is the Scala decompression path, and in$ Add the corresponding bin directory to the PATH variable.

Make the environment effective

Check whether the installation is successfulAlready done!

Install the spark

Decompress the installation package (sudo tar -zxvf spark-2.4.8-bin-without-hadoop. TGZ -c /usr/local/), change the owning user and user group, and rename the directory to spark-2.4.8 to facilitate subsequent configuration.

Change the owning user and user group

Rename the directory to Spark-2.4.8

Configure the environment variables, add the SPARK_HOME variable, and add the corresponding bin directory to the PATH variable. Export SPARK_HOME = / usr/local/spark – from 2.4.8 export PATH = $PATH:$ SPARK_HOME/bin

Spark configuration file configuration:

Copy the spark-env.sh.template file to the spark-env.sh file:

The configuration is as follows:

Start Spark: Before starting Spark, start the HDFS

After the Spark workers are started, access Master:8080 to view the status of Spark workers.

Spark-shell Displays the Spark shell

There will be such mistakes

But there’s no need to panic! It does not affect use in Scala, and you can solve this by adding system environment variables. export TERM=xterm-color

There wouldn’t be

1.5 For example, run the Spark self-delivered example through the spark-submit command. The spark bootstring examples are provided in SPARK_HOME/examples/jars/spark-examples_2.11-2.4.8.jar:

spark-submit --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_211.2 -4.8.jar
Copy the code

Note: When running a SparkPi instance, many run logs are generated. You can filter the logs by adding the grep command to display the information you care about:

5.2.2 Scala Code written in Spark Shell

(1) Respectively fromThe local file,Files in the HDFSAs well asParallelized () method of Spark ContextRDD_1, RDD_2, and RDD_3 are generated respectively. The local file format must be multiple words per line, separated by Spaces. In the HDFS, each line contains one word, that is, each word is separated by a newline. Each RDD must contain one or more of your student number or name. 1.1 Creating in.txt locallyWrite contentUploaded to spark1.2 Creating the in0.txt file locallyWrite dataUpload the file to the HDFSCheck whether the upload is successfulUploaded to spark

1.3 Creating files on Spark

Creation success!

(2) output the first line of RDD_1, all contents of RDD_2, and the maximum value of RDD_3; 2.1 First line of RDD_12.2 All contents of RDD_2

2.3 Maximum value of RDD_3

(3) Count the occurrence times of “name pinyin” and “student number” in RDD_1;

Results: There were 6 ZQC and 4 031904102

(4) Delete RDD_2 from the deduplicated RDD_1;

(5) Merge the above results with RDD_3, and write RDD_3 to the local file system and HDFS file system respectively.Check whether the entry is successful(6) Write Scala code to write any content to HDFS. The file path is customized. The file is named with “student ID – name pinyin.txt”.

So let’s create a fileCheck the HDFS

5.2.3 Writing A Scala independent Application:

Spark programs written in Scala need to be compiled and packaged using SBT. Spark does not have an SBT and needs to be installed separately. You can go to the official website to download the SBT installation file, the latest version

Download good

Create a directory

Run the following command to install SBT in the /usr/local/sbt directory:

Copy sbt-launch.jar from the bin directory to the SBT installation directory

Create a new file and write the following

#! /bin/bash 
SBT_OPTS="-Xms512M -Xmx1536M -Xss1M -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=256M" 
java $SBT_OPTS -jar `dirname $0`/sbt-launch.jar "$@"
Copy the code

After the file is saved, you need to add the executable permission for the Shell script file. Then, you can run the SBT sbtVersion command to view the SBT version information:It’s done. It’s a little slow!

(1) Implement the wordCount function and write the result to a local file; Create a directory locally

Create this file.

Write data.

Check the directory structure(2) Package the above programs with SBT respectively;(3) Execute the generated JAR through spark-submit.

Writing a Scala standalone application:

The implementation generates any RDD and writes the result to a file;Copy the code

Rename and set the permission group

On the terminal, run the following command to create a folder spark_zqc_maven_scala as the application root directory:

Write the following

(2) Package the above programs separately with Maven; The program relies on the Spark Java API, so we need to use Maven to compile the package. Create a new file pom. XML in the./spark_zqc_maven_scala directory, and then add the following contents to the pom. XML file to declare the information about the independent application and its dependency with Spark:

To ensure that Maven is working properly, first check the file structure of the entire application by executing the following command,

Next, we can package the entire application as a JAR with the following code (note: The computer needs to stay connected to the network, and Maven automatically downloads the dependencies when you first run the package command, which takes a few minutes) :

(3) Execute the generated JAR through spark-submit.

The last

Xiao Sheng Fan Yi, looking forward to your attention.