1/ Jupyter Notebook’s current computing types include
<1> Local mode (i.e. stand alone mode)
Based on platform instance resources, python2, PYTHon3,R and other cores are used for local computing (i.e., single-machine mode). By default, the 10G and 20G packages are provided for limited computing data (data processing within one million rows is recommended). For some modeling users (model training requires a large sample size), the 50G package is provided for support That is to say: although it is single-machine mode, there are 10G, 20G and 50G3 packages.Copy the code
<2> If a larger amount of data is to be calculated, distributed computing is recommended (submitted to the Spark distributed computing engine for calculation).
New approach (recommended) : Create a Notebook directly using the Python kernel and submit spark jobs using the integrated Myutil method. The old method (not recommended) : Create a pySpark script file for the kernel and rely on the Livy agent to submit spark jobs. This method is not recommendedCopy the code
<3> According to the above <1> and <2> :
There are only two computing types: single-machine deployment mode and Spark distributed computing engine mode Single-machine deployment mode This mode is suitable for distributed mode with a small amount of data. This mode is suitable for a large amount of data.Copy the code
2/ Local Computing (stand alone mode)
Single-machine mode is not covered here because single-machine mode is not the focus of this articleCopy the code
3/ Spark Distributed computing engine: Submit spark jobs
<1> Submit spark jobs using the PYTHon3 kernel + integrated Myutil method
The benefits of the new Spark session generation mode are as follows: Create a Notebook Jupyter instance using the Python3 kernel. If the network mode is changed to host, spark sessions can be generated in the Python kernel and communicate with distributed processing systems. The new Spark session generation mode supports PYTHon2 and PYTHon3 Kernels. In the original mode, the PySpark kernel communicates with the distributed system through the Livy agent. In addition, the Python version on LIvy supports only the thin Python kernel with the 2. X parameter configuration. The thin Python kernel can obtain information such as the development group of the current user and implement automatic configuration after the user specifies the calculation priorityCopy the code
<2> Create a Spark session using myutil integrated with the Python Kernel
Spark_session is like a house. With the house, you can do everything you want in the house. Without the house, you can't do anything (the house is an environment). In the PYTHon3 kernel, use the get_sparkSession () method instead of creating sessions through the PySpark kernel, using the new Get_sparkSession method provided by Myutil.Copy the code
import myutil
# arguments are not required because they have default values
spark_session = myutil.get_sparksession(app_name,priority,conf)
The # get_sparkSession () function takes three arguments, none of which are required and all have default values.
App_name Name of apark job created
# priority Spark job priority
# conf # other parameters
conf = {"spark.driver.cores": 5.# Default 1, maximum 40
"spark.driver.memory": "10g".Default 5g, maximum 80G
"spark.executor.memory": "10g".Default 5g, maximum 80G
"spark.executor.instances": 8.By default, the number of Executors is automatically adjusted according to the number of tasks
"spark.executor.cores": 5 # default 2, maximum 40}
Copy the code
Spark_session.stop () spark_session.stop()Copy the code
<3> Read data from the Hive table
After the SPARk_session is generated, you can use the spark SQL command to query Hive data. The example code is as follows:Copy the code
# But note: The spark dataframe structure is returned, not the Dataframe structure in Pandas
Use the toPandas method if local processing is required
Generate spark DataFrame
import myutil
# Conf is not required, you can default, if default, use the default conf
spark_session = myutil.get_sparksession(app_name='xxx',priority=? ,conf=dict)
sql_command = "select * from ks_hr.jupyter_gao_test"
sdf = spark_session.sql(sql_command)
sdf.show()
Copy the code
<4> Local processing of spark Dataframe
After querying hive tables using the Spark SQL () function, the Spark Dataframe data object is obtained, not the Pandas data object. Because it is the Spark dataframe, you can export it to the local Pandas Dataframe and then perform the following operations, such as drawing and analysisCopy the code
# Save the cost to pandas DataFrame
import myutil
spark_session = myutil.get_sparksession(app_name,priority,conf)
sql_command = "select * from ks_hr.jupyter_gao_test"
sdf = spark_session.sql(sql_command)
sdf.show()
data_df = sdf.toPandas() # spark dataframe --> pandas dataframe
print( data_df.head() )
Copy the code
Why convert spark dataframe to Pandas Dataframe? After the data is processed by the Spark computing engine, it can be converted to Pandas Dataframe, and then drawn and analyzed.Copy the code
<5> Spark reads HDFS files
import myutil
spark_session = myutil.get_sparksession()
# viewfs HDFS file: / / hadoop - lt - cluster/home/HDP/gaozhenmin/fortune500 CSV
data = spark_session.read.format("csv").option("header"."true").load("/home/hdp/gaozhenmin/fortune500.csv")
data.show()
Copy the code
<6> Add a third-party library to the PySpark cluster environment
Myutil uses PySpark to use third-party libraries directly. You only need to install them locally. Note: If there is a compiled. So file in the installation directory of a third-party library (such as Pyarrow), you cannot upload it in the following ways. Please submit your requirements through the float window button at the lower right corner of the Jupyter pageCopy the code
1 Package the local library
Note: Follow the steps carefully, and note that the package is packaged at the directory level python2.7 package (if python3.7, change the PIP command below to PIp3) Python :2.7.5, numpy:1.16.1, and numpy:1.16.1. PIP show scipy #Name: scipy #Version: 1.2.1 #Summary: scipy: Scientific Library for Python #Home-page: https://www.scipy.org #Author: None #Author-email: None #License: BSD # Location: / home/gaozhenmin/local/lib/python2.7 / site - packages # the Requires: Numpy # go to the corresponding directory (enter the corresponding library folder), package the library file, Package command to zip -r/TMP / < package name >. Zip package name / * CD ~ /. Local/lib/python2.7 / site - packages zip - r/TMP/scipy. Zip scipy / * # Zip #-rw-rw-r-- 1 gaozhenmin HDP 28168027 May 18 14:31 / TMP /scipy.zipCopy the code
2 Set spark conf
conf = {"spark.submit.pyFiles":"/tmp/scipy.zip"}
spark_session = get_sparkseesion(conf=conf)
Copy the code
<6> Submit the Spark ML job
Here is the example code:Copy the code
import myutil
from pyspark.ml.classification import LogisticRegression # Import model
# Load training data
spark_session = myutil.get_sparksession(app_name,priority,conf)
training = spark_session.read.format("libsvm").load("xxx.txt")
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
# Fit the model
lrModel = lr.fit(training)
trainingSummary = lrModel.summary
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))
fMeasure = trainingSummary.fMeasureByThreshold
maxFMeasure = fMeasure.groupBy().max('F-Measure').select('max(F- Measure)').head()
bestThreshold = fMeasure.where(fMeasure['F-Measure'] == maxFMeasure['max(F-Measure)']).select('threshold').head()['threshold']
lr.setThreshold(bestThreshold)
# We can also use the multinomial family for binary classification
mlr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0.8)
# Fit the model
mlrModel = mlr.fit(training)
trainingSummary = mlrModel.summary
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))
Copy the code
<7> Use spark-submit to submit the Spark job
If you want to use a method other than PySpark SQL or PySpark ML to submit a Spark cluster job, you can use spark-submit to submit the Spark job. For details, see Spark-submitCopy the code
<8> Description of Spark Magic
Sparkmagic is a set of tools used interactively through Livy, a Spark REST server, on laptops in Jupyter. This project includes a set of interactively running Spark code and some cores that can be used to convert Jupyter into an integrated Spark environment. The Spark session is temporarily unavailable when the Current Python kernel starts it.Copy the code