0 Related to the source code

1 Spark environment installation

◆ Spark is written in Scala and provides multiple language interfaces, requiring a JVM

◆ The official version of Spark is provided for us, so manual compilation is unnecessary

◆ Spark is easy to install and configure, and the Hadoop environment is not required

  • download

  • Unpack the

The tar ZXVF spark - against 2.4.1 - bin - hadoop2.7. TGZCopy the code

2 the Spark configuration

Before the configuration, try to read the official documents, avoid directly looking for configuration tutorials on the Internet

◆ To set the use of memory for the node, otherwise it may lead to low node utilization;

◆ Set the IP address and port number of Spark to avoid UnknownHostException

Website configuration

  • Apply the Default configuration

  • The configuration file

  • Copy two templates and enable self-configuration

Single-machine Environment Configuration

  • The local IP

Shell authentication

bin/spark-shell
Copy the code

3 Spark shell

◆ Spark shell is a bash script in the./bin directory

◆ Spark Shell configures the context and session for us.

  • The context instance

  • The session instance

  • UI

4 practical Wordcount

4.1 introduction of Wordcount

◆ Wordcount word frequency statistics, is the most basic task in big data analysis.

First, extract all the words in the file, and then merge the same words

  • Implementation diagram

Project structures,

  • Add the Spark JAR package

  • Select jar package, first left-click select the first, then shift, then left-click the last to select all.

  • The new class

  • The test file

`pwd`/`ls |grep L`
Copy the code

  • Write a function

  • The successful running

  • packaging

  • Remove these extra JAR packages

  • build

  • Put the JAR package into the spark/bin directory and run it using spark-submit

Spark Machine learning Practice series

  • Spark based Machine learning Practices (PART 1) – Introduction to machine learning
  • Spark based Machine learning practices (II) – Introduction to MLlib
  • Spark based machine learning practice (III) – Actual environment construction