0 / instructions

On Linux, use PySpark. In fact, Spark is installed. Run the PySpark command in the bin directory of the installation directory to go to the PySpark development page.Copy the code

1 / download

To the official website to download the apache spark's official website: https://spark.apache.org/downloads.html or is mirror, tsinghua university library: https://mirrors.tuna.tsinghua.edu.cn/Copy the code

2/ Upload the file to the Linux server from the local PC

Run the rz spark-3.1.1-bin-hadoop3.2. TGZ commandCopy the code

3 / uncompress

Tar -zxvf spark-3.1.1-bin-hadoop3.2. TGZ Generates a spark-3.1.1-bin-hadoop3.2 directoryCopy the code

4/ Set environment variables

In the.bashrc file, write (based on your own situation, Export SPARK_HOME=/home/hadoop/spark-2.1.0-bin-hadoop2.7 export PATH = $PATH: / home/hadoop/spark - 2.1.0 - bin - hadoop2.7 / bin export PYTHONPATH = $SPARK_HOME/python: $SPARK_HOME/python/lib/py4j - 0.10.4 - SRC. Zip: $PYTHONPATH export PATH=$SPARK_HOME/python:$PATHCopy the code

5/ Make environment variables take effect immediately

source .bashrc 
   
Copy the code

6/ Start the PySpark shell

Go to the installation directory spark-3.1.1-bin-hadoop3.2/bin/ under./pysparkCopy the code

7/ There are two ways to program with PySpark

<1> sparkcontext()

Sparkcontext () is the entry point to any Spark function. When we run any Spark application, we start a driver that has the main function, starts SparkContext there, and then performs the specific operation of the program on the working node.Copy the code

Here are the parameters of SparkContext. Master - This is the URL of the cluster to which it is connected. AppName - Your work name. SparkHome - Spark installation directory. PyFiles - a. Zip or. Py file to be sent to the cluster and added to PYTHONPATH. Environment - Work node environment variable. BatchSize - Number of Python objects represented as a single Java object. Set 1 to disable batch processing, 0 to automatically select batch size based on object size, or -1 to use unlimited batch size. Serializer - RDD serializer. Conf -l {SparkConf} an object that sets all Spark properties. Gateway - Use the existing gateway and JVM, or initialize the new JVM. Jsc-javasparkcontext instance. Profiler_cls - used for performance analysis of a class of custom Profiler. Default is pyspark Profiler. BasicProfiler). In the above parameters, master and appname are mainly used. The First two lines of any PySpark program look like this: From PySpark Import SparkContext sc = SparkContext("local", "First App")Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Pyspark installation and use

0 / instructions

1 / download

2/ Upload the file to the Linux server from the local PC

3 / uncompress

4/ Set environment variables

5/ Make environment variables take effect immediately

6/ Start the PySpark shell

7/ There are two ways to program with PySpark

<1> sparkcontext()

<2> The second way

<3> The first way

Pyspark installation and use

0 / instructions

1 / download

2/ Upload the file to the Linux server from the local PC

3 / uncompress

4/ Set environment variables

5/ Make environment variables take effect immediately

6/ Start the PySpark shell

7/ There are two ways to program with PySpark

<1> sparkcontext()

<2> The second way

<3> The first way

Related Posts

A review of small sample learning in computer vision

B station holding ali shares such as han, UP in the main birth of the next Li Jiaqi?

Re: Machine learning – Encoder-Decoder architecture from scratch