preface
Describe in one sentence what you can learn from this passage:
Pack your first oneScala
Program, and throw into the previously createdSpark
Run on a cluster.
To complicate things, this is:
Docker part
-
Continue to learn common Docker operations, such as: mapping ports, mounting directories, transferring variables, etc.
-
Continue in-depth study of Dockerfile, familiar with ARG,ENV,RUN,WORKDIR,CMD and other directives;
Scala part
Scala
Basic grammar,Scala
Write the first oneSpark
Application program;SBT
Package the application with a configuration manifest.
The Spark part
- Submit written
Scala
Applications,--class
The main class.
configurationScala
Runtime environment
The installation
Spark is written in Scala, so I use Scala to write programs here.
Continue writing Dockerfile based on the openJDK image mentioned in the previous article:
#
# Scala and sbt Dockerfile
#
# https://github.com/spikerlabs/scala-sbt (based on https://github.com/hseeberger/scala-sbt)
#
# Pull base image
FROM openjdk:8-alpine
ARG SCALA_VERSION
ARG SBT_VERSION
ENV SCALA_VERSION ${SCALA_VERSION:-2.12.8}
ENV SBT_VERSION ${SBT_VERSION:-1.2.7}
RUN \
echo "$SCALA_VERSION $SBT_VERSION"&& \ mkdir -p /usr/lib/jvm/java-1.8 -openJDK /jre && \ touch /usr/lib/jvm/java-1.8 -openJDK /jre/release && \ apk add --no-cache bash && \ apk add --no-cache curl && \ curl -fsL http://downloads.typesafe.com/scala/$SCALA_VERSION/scala-$SCALA_VERSION.tgz | tar xfz - -C /usr/local && \
ln -s /usr/local/scala-$SCALA_VERSION/bin/* /usr/local/bin/ && \
scala -version && \
scalac -version
RUN \
curl -fsL https://github.com/sbt/sbt/releases/download/v$SBT_VERSION/sbt-$SBT_VERSION.tgz | tar xfz - -C /usr/local && \
$(mv /usr/local/sbt-launcher-packaging-$SBT_VERSION /usr/local/sbt || true) \
ln -s /usr/local/sbt/bin/* /usr/local/bin/ && \
sbt sbt-version || sbt sbtVersion || true
WORKDIR /project
CMD "/usr/local/bin/sbt"
Copy the code
Note that the two arguments at the beginning of Dockerfile, SCALA_VERSION and SBT_VERSION, are user-specified.
Then compile the Dockerfile:
# Notice the last "." -- the current directoryDocker build-t Vinci/Scala-sbt :latest \ --build-arg SCALA_VERSION=2.12.8 \ --build-arg SBT_VERSION=1.2.7 \Copy the code
It may take a while please be patient
test
Create a new temporary interactive container to test:
docker run -it --rm vinci/scala-sbt:latest /bin/bash
Copy the code
Enter scala-version and SBT sbtVersion in sequence
If the following information is displayed in the container, the installation is successful.
Bash - 4.4 -# scala -version
Scala code runner version 2.12.8 -- Copyright 2002-2018, LAMP/EPFL and Lightbend, Inc.
bash-4.4# sbt sbtVersion
[warn] No sbt.version set in project/build.properties, base directory: /local
[info] Set current project to local (in build file:/local/ [info] 1.2.7Copy the code
Mount local files
In order for us to be able to access our local files, we need to install a volume from our working directory to a location on the running container.
We simply add the -v option to the run directive, as follows:
mkdir -p /root/docker/projects/MyFirstScalaSpark
cd /root/docker/projects/MyFirstScalaSpark
docker run -it --rm -v `pwd`:/project vinci/scala-sbt:latest
Copy the code
Note:
pwd
Refers to the current directory (Linux virtual machine: / root/docker/projects/MyFirstScalaSpark);/project
Is mapped to the directory inside the container;- Don’t use
/bin/bash
, you can directly log in toSBT
The console.If you look closely at the previous Dockerfile configuration, the last line specifies the default command to execute, and the penultimate line specifies the working directory
After a successful login, the following message is returned:
[root@localhost project]# docker run -it --rm -v `pwd`:/project vinci/scala-sbt:latest
[warn] No sbt.version set in project/build.properties, base directory: /local
[info] Set current project to local (in build file:/local/)
[info] sbt server started at local: / / / root/SBT / 1.0 / server / 05 a53a1ec23bec1479e9 / sock SBT:local>
Copy the code
The first program
Configure the environment
Now it’s time to start writing your first Spark application.
But you can also see [WARN] in the output from the previous section because the SBT version is not set, which is a problem with the configuration file.
Under the project directory we created, we will create build. SBT
name : = "MyFirstScalaSpark"
version : = "0.1.0 from"
scalaVersion : = "2.11.12"
libraryDependencies + = "org.apache.spark" % % "spark-sql" % "2.4.0"
Copy the code
This gives us a minimal project definition.
Note: We have specified the Scala version as 2.11.12 because Spark is compiled for Scala 2.11, but the Scala version on the container is 2.12. In the SBT console, run the reload command to refresh the SBT project with the new build Settings:
Write the code
Create a SSH connection to CentOS:
Create a directory:
mkdir -p /root/docker/projects/MyFirstScalaSpark/src/main/scala/com/example
cd /root/docker/projects/MyFirstScalaSpark/src/main/scala/com/example
vim MyFirstScalaSpark.scala
Copy the code
As follows:
package com.example
import org.apache.spark.sql.SparkSession
object MyFirstScalaSpark {
def main(args: Array[String]) {
val SPARK_HOME = sys.env("SPARK_HOME")
val logFile = s"${SPARK_HOME}/README.md"
val spark = SparkSession.builder
.appName("MyFirstScalaSpark")
.getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
spark.stop()
}
}
Copy the code
packaging
Enter into the SBT container and enter
package
Copy the code
After waiting for a long time, the following interface is displayed, indicating that the package is successfully packaged:
Submit a task
Packaged jar package: / root/docker/projects/MyFirstScalaSpark/target/scala – 2.11 directory
Start theSpark
Clusters (see Chapter 1) :
cd /root/docker/spark
docker-compose up --scale spark-worker=2
Copy the code
Start theSpark
Client container
cd /root/docker/projects/MyFirstScalaSpark
docker run --rm -it -e SPARK_MASTER="spark://spark-master:7077" \
-v `pwd`:/project --network spark_spark-network \
vinci/spark:latest /bin/bash
Copy the code
Submit a task
Go to the Spark client container and enter the following statements:
spark-submit --master $SPARK_MASTER\ - class com. Example. MyFirstScalaSpark \ / project/target/scala - 2.11 / myfirstscalaspark_2. 11-0.1.0 from. The jarCopy the code
Result output:
Lines with a: 62, Lines with b: 31
The operation succeeds.
This concludes the chapter.