I. Overview of Spark
1. Introduction to Spark
Spark is a memory-based, universal, scalable cluster computing engine designed for large-scale data processing. It implements an efficient DAG execution engine that can efficiently process data streams based on memory, with significantly higher computing speed than MapReduce.
2. Operation structure
Driver
Run the main() function in Applicaion of Spark to create SparkContext. SparkContext communicates with the Cluster-Manager, applies for resources, allocates tasks, and monitors the Cluster.
ClusterManager
Applies for and manages resources required for running applications on WorkerNode. It can efficiently scale computing from one compute node to thousands of compute nodes, including Spark’s native ClusterManager, ApacheMesos, and HadoopYARN.
Executor
Application A process running on a WorkerNode that is responsible for running tasks and storing data in memory or disk. Each Application has its own set of executors that are independent of each other.
Two, environment deployment
1. Scala environment
Installation Package Management
[root@hop01 opt]# tar -zxvf scala-2.12.2. TGZ [root@hop01 opt]# mv scala-2.12.2 scala2.12Copy the code
Configuration variables
[root@hop01 opt]# vim /etc/profile
export SCALA_HOME=/opt/scala2.12
export PATH=$PATH:$SCALA_HOME/bin
[root@hop01 opt]# source /etc/profile
Copy the code
Version to view
[root@hop01 opt]# scala -version
Copy the code
The Scala environment needs to be deployed on the relevant service nodes that Spark runs on.
2. Spark basic environment
Installation Package Management
[root@hop01 opt]# tar -zxvf spark-2.1.1-bin-hadoop2.7.tgz
[root@hop01 opt]# mv spark-2.1.1-bin-hadoop2.7 spark2.1
Copy the code
Configuration variables
[root@hop01 opt]# vim /etc/profile
export SPARK_HOME=/opt/spark2.1
export PATH=$PATH:$SPARK_HOME/bin
[root@hop01 opt]# source /etc/profile
Copy the code
Version to view
[root@hop01 opt]# spark-shell
Copy the code
3. Configure the Spark cluster
Service node
[root@hop01 opt]# CD /opt/spark2.1/conf/ [root@hop01 conf]# cp slaves. Template Slaves [root@hop01 conf]# vim slaves hop01 hop02 hop03Copy the code
Environment configuration
[root@hop01 conf]# cp spark-env.sh.template spark-env.sh [root@hop01 conf]# vim spark-env.sh export JAVA_HOME=/opt/jdk1.8 export SCALA_HOME=/opt/scala2.12 export SPARK_MASTER_IP= HOP01 export SPARK_LOCAL_IP= INSTALLATION node IP address export SPARK_WORKER_MEMORY = 1 g export HADOOP_CONF_DIR = / opt/hadoop2.7 / etc/hadoopCopy the code
Note the configuration of SPARK_LOCAL_IP.
4. Start Spark
It depends on the Hadoop-related environment, so start it first.
Start: /opt/spark2.1/sbin/start-all.sh Stop: /opt/spark2.1/sbin/stop-all.shCopy the code
Here, two processes will be started on the Master node: Master and Worker, while only one Worker process will be started on other nodes.
5. Access the Spark cluster
The default port number is 8080.
http://hop01:8080/
Copy the code
Running basic cases:
[root@hop01 spark2.1]# CD /opt/spark2.1/ [root@hop01 spark2.1]# bin/spark-submit --class Org. Apache. Spark. Examples. SparkPi - master the local examples/jars/spark - examples_2. 11-2.1.1. Jar run results: Pi is roughly 3.1455357276786384Copy the code
Iii. Development cases
1. Core dependencies
Rely on Spark2.1.1:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>The spark - core_2. 11</artifactId>
<version>2.1.1</version>
</dependency>
Copy the code
Introducing Scala compiler plugins:
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
Copy the code
2. Case code development
Reads the file at the specified location and outputs the statistics of the words in the file contents.
@RestController
public class WordWeb implements Serializable {
@GetMapping("/word/web")
public String getWeb (a){
1. Create a configuration object for Spark
SparkConf sparkConf = new SparkConf().setAppName("LocalCount")
.setMaster("local[*]");
// create a SparkContext object
JavaSparkContext sc = new JavaSparkContext(sparkConf);
sc.setLogLevel("WARN");
// 3. Read the test file
JavaRDD lineRdd = sc.textFile("/var/spark/test/word.txt");
// 4
JavaRDD wordsRdd = lineRdd.flatMap(new FlatMapFunction() {
@Override
public Iterator call(Object obj) throws Exception {
String value = String.valueOf(obj);
String[] words = value.split(",");
returnArrays.asList(words).iterator(); }});// 5
JavaPairRDD wordAndOneRdd = wordsRdd.mapToPair(new PairFunction() {
@Override
public Tuple2 call(Object obj) throws Exception {
// Mark words:
return new Tuple2(String.valueOf(obj), 1); }});// 6, count the number of words
JavaPairRDD wordAndCountRdd = wordAndOneRdd.reduceByKey(new Function2() {
@Override
public Object call(Object obj1, Object obj2) throws Exception {
returnInteger.parseInt(obj1.toString()) + Integer.parseInt(obj2.toString()); }});// sort
JavaPairRDD sortedRdd = wordAndCountRdd.sortByKey();
List<Tuple2> finalResult = sortedRdd.collect();
// 8. Print the result
for (Tuple2 tuple2 : finalResult) {
System.out.println(tuple2._1 + "= = = >" + tuple2._2);
}
// 9. Save the statistics
sortedRdd.saveAsTextFile("/var/spark/output");
sc.stop();
return "success"; }}Copy the code
Package execution results:
View file output:
[root@hop01 output]# vim /var/spark/output/part-00000
Copy the code
Source code address
Making address GitEE, https://github.com/cicadasmile/big-data-parent, https://gitee.com/cicadasmile/big-data-parentCopy the code