A previous articleRun your first Hadoop program quickly!This article will introduce Spark, another computing engine framework, to build the program environment to write and run the code, through the simple case of WordCount to show the Spark program build and run

Spark Environment Construction

First, build the environment

Prerequisite: JDK has been installed (this is the last version installed, because most companies are still using JDK 1.8), my computer has been installed, not to mention here

To do a good job, you must sharpen your tools first, the environment is very important!! Because your code cannot run after normal writing, it is most likely due to environmental problems (version mismatch, etc.), so environment installation is particularly important. The version numbers of this installation software are respectively Scala2.12 and Spark3.0.0, so it must be the same as mine, otherwise unknown errors may occur.

1. Install Scala2.12.11

Step 1: Check out the installable version of Scala in BREW

brew search scala 
Copy the code

Step 2: Install the specified version of Scala

The brew install [email protected]Copy the code

Installation path:

/ usr/local/Cellar/[email protected]/2.12.12Copy the code

Step 3: Configure the Scala environment variables by executing the following command:

Echo 'export PATH="/usr/local/opt/[email protected]/bin:$PATH"' >> /Users/ ZXB /. Bash_profile source ~/. Bash_profile // Make the above configuration take effectCopy the code

Note: /Users/ ZXB /.bash_profile is the absolute path of your environment variable configuration file, which can be configured according to your own computer

  • Inspection scala

After installing Scala above, execute Scala on the terminal to verify that Scala is successfully installed

The above screen shows that the installation and configuration of Scala2.12 is complete!

2. Install Spark

Step 1: Download the Spark3.0.0 installation package. I have the download link here

Link: pan.baidu.com/s/1Qhlmw8Yj… Password: s2j7

Step 2: Decompress the Spark installation package to a specified directory. (You are advised to install big-data-related software in the same directory for easy management.)

Step 3: Run spark-shell to verify the installation

Go to the bin directory in the main installation directory

CD/Users/Bigdata/spark - 3.0.0 - bin - hadoop3.2 / binCopy the code

Perform the spark – shell

If the preceding screen is displayed, Spark is successfully installed.

3. Configure idea environment

  • Installing the Scala plug-in

Open IDEA and Plugins, and search for Scala installation in the plug-in market. The installation speed is slow

You can also choose an offline installation. The following is the offline installation package

Link: pan.baidu.com/s/12A5FY6xL… Password: PFGB

Two, the local execution

1. Execute wordcount

Go to the bin directory of the spark file directory, create an input directory, create an inputword text file, write some data in the input directory, and run the spark-shell command

As you can see, the result is already displayed, counting the number of occurrences of each word.

2. Official examples

Note: This is done by command submission using jar packages

Go to the bin directory of the spark file directory and run the following command

spark-submit --class org.apache.spark.examples.SparkPi --master local[2] .. / examples/jars/spark - examples_2. 12-3.0.0. Jar 10Copy the code

Results:

3. Run the Spark program in IDEA

The above are all implemented through spark console terminal, but we usually use professional IDE for development, the mainstream is the idea software, next to demonstrate how to use IDEA to develop spark program, we still develop a simple WordCount program to demonstrate.

3.1 Idea project creation and configuration

  • To create a regular Maven project, click Next

Then edit the project information

  • Build the module

For the convenience of managing the code later, we build modules to complete the classification

First delete the default SRC directory and then new->Module. Then, as above, select Maven and fill in the information

  • Add the scala SDK

File->Project Structure ->Global Libraries, click the “+” icon at the top to install Scala locally

  • Adding framework support

Right click on the project ->Add FrameworkSupport->Scala. If the Scala SDK is already imported, it will display the Scala JDK you just configured

3.2. Develop spark

package com.dongci777.bigdata.spark.core.wc

import org.apache.spark.{SparkConf.SparkContext}

/** * @author: ZXB * @date: 2020/12/20 1:45 am */
object WordCount3 {
  def main(args: Array[String) :Unit = {
    val conf = new SparkConf().setMaster("local").setAppName("WordCount")
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")
    val lines = sc.textFile("datas/wordcount")
    val words = lines.flatMap(_.split(""))
    val wordToOne = words.map(word => (word, 1))
    val wordCount = wordToOne.reduceByKey(_ + _)
    val array = wordCount.collect()
    array.foreach(println)
  }
}

Copy the code

Before writing the code and running the program, let’s configure the log4j.properties file so that we can view the results and prevent unnecessary log printing

  • Log4j configuration properties

Create a new file under the resource directory called log4j.properties and edit the file as follows:

log4j.rootCategory=ERROR, console log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd  HH:mm:ss} %p %c{1}: %m%n# Set the default spark-shell log level to ERROR. When running the spark-shell,
the
# log level for this class is used to overwrite the root logger's log level, so
that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=ERROR
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=ERROR
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=ERROR
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent
UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
Copy the code

Then run the program and see the output:

If the result is printed, the Spark program executes successfully!