Abstract: This article describes how to build a Spark cluster development environment based on Jupyter Notebook.

This article is shared by Apr Pengpeng in The Spark Cluster Development Environment Based on Jupyter Notebook in Huawei Cloud community.

I. Concept introduction:

Sparkmagic: It is a working tool in JupyterNotebook that interacts with a remote Spark cluster through the Livy server Spark REST. The Sparkmagic project includes a set of frameworks for interactively running Spark code in multiple languages and some kernels that you can use to convert the code in the Jupyter Notebook to run in the Spark environment.

2. Livy: It is an open source REST service based on Spark. It can submit code snippet or serialized binary code to the Spark cluster for execution through REST. It provides the following basic functions: Submit Scala, Python, or R code snippet to the remote Spark cluster for execution, submit Spark jobs written by Java, Scala, or Python to the remote Spark cluster for execution, and submit batch applications to run in the cluster

Ii. The basic framework is shown in the following figure:

Three, preparation:

The Saprk cluster is provided. You can build or use Huawei cloud services, such as MRS, and install the Spark client on the cluster. With node () can be docker containers or virtual machine installation Jupyter Notebook and Livy, the path of the installation package for: livy.incubator.apache.org/download/

Configure and start Livy:

Modify livy. Conf reference: enterprise-docs.anaconda.com/en/latest/a…

Add the following configuration:

livy.spark.master =yarnlivy.spark.deploy-mode = clusterlivy.impersonation.enabled = falselivy.server.csrf-protection.enabled = falselivy.server.launch.kerberos.keytab=/opt/workspace/keytabs/user.keytablivy.server.launch.kerberos.principal=minerliv y.superusers=miner

Modify livy-env.sh to configure environment variables such as SPARK_HOME and HADOOP_CONF_DIR

exportJAVA_HOME=/opt/Bigdata/client/JDK/jdkexport HADOOP_CONF_DIR=/opt/Bigdata/client/HDFS/hadoop/etc/hadoopexport SPARK_HOME=/opt/Bigdata/client/Spark2x/sparkexport SPARK_CONF_DIR=/opt/Bigdata/client/Spark2x/spark/confexport LIVY_LOG_DIR = / opt/workspace/apache – livy – 0.7.0 – incubating – bin/logsexport LIVY_PID_DIR = / opt/workspace/apache – livy – 0.7.0 – incubating – bin/pidsexport LIVY_SERVER_JAVA_OPTS=”-Djava.security.krb5.conf=/opt/Bigdata/client/KrbClient/kerberos/var/krb5kdc/krb5.conf-Dzookeeper .server.principal=zookeeper/hadoop.hadoop.com-Djava.security.auth.login.config=/opt/Bigdata/client/HDFS/hadoop/etc/hadoo p/jaas.conf-Xmx128m”

Start the Livy:

./bin/livy-server start

5. Install Jupyter Notebook and SparkMagic

Jupyter Notebook is an open source and widely used project, so the installation process won’t be covered here

Sparkmagic is a kernel in the Jupyter Notebook. PIP install SparkMagic. GCC python-dev libkrb5-dev is available before installation. If not, install apt-get install or yum install. Installed will generate $HOME /. Sparkmagic/config. The json file, the file for sparkmagic key configuration files, compatible with the spark configuration. The key configurations are shown in The figure

The URL is the IP address and port of the Livy service and supports HTTP and HTTPS

Add sparkMagic kernel

PYTHON3_KERNEL_DIR=”(jupyterkernelspec list | grep -w “python3” | awk ‘{print 2}’)”KERNELS_FOLDER=”
( d i r n a m e (dirname ”
{PYTHON3_KERNEL_DIR}”)”SITE_PACKAGES=”(pip show sparkmagic|grep -w “Location” | awk'{print 2}’)”cp -r
S I T E _ P A C K A G E S / s p a r k m a g i c / k e r n e l s / p y s p a r k k e r n e l {SITE\_PACKAGES}/sparkmagic/kernels/pysparkkernel
{KERNELS_FOLDER}

Run the Spark code on the Jupyter Notebook.

Query current session logs with Livy:

Click to follow, the first time to learn about Huawei cloud fresh technology ~