1/ Jupyter Notebook’s current computing types include

<1> Local mode (i.e. stand alone mode)

Based on platform instance resources, python2, PYTHon3,R and other cores are used for local computing (i.e., single-machine mode). By default, the 10G and 20G packages are provided for limited computing data (data processing within one million rows is recommended). For some modeling users (model training requires a large sample size), the 50G package is provided for support That is to say: although it is single-machine mode, there are 10G, 20G and 50G3 packages.Copy the code

<2> If a larger amount of data is to be calculated, distributed computing is recommended (submitted to the Spark distributed computing engine for calculation).

New approach (recommended) : Create a Notebook directly using the Python kernel and submit spark jobs using the integrated Myutil method. The old method (not recommended) : Create a pySpark script file for the kernel and rely on the Livy agent to submit spark jobs. This method is not recommendedCopy the code

<3> According to the above <1> and <2> :

There are only two computing types: single-machine deployment mode and Spark distributed computing engine mode Single-machine deployment mode This mode is suitable for distributed mode with a small amount of data. This mode is suitable for a large amount of data.Copy the code

2/ Local Computing (stand alone mode)

Single-machine mode is not covered here because single-machine mode is not the focus of this articleCopy the code

3/ Spark Distributed computing engine: Submit spark jobs

<1> Submit spark jobs using the PYTHon3 kernel + integrated Myutil method

The benefits of the new Spark session generation mode are as follows: Create a Notebook Jupyter instance using the Python3 kernel. If the network mode is changed to host, spark sessions can be generated in the Python kernel and communicate with distributed processing systems. The new Spark session generation mode supports PYTHon2 and PYTHon3 Kernels. In the original mode, the PySpark kernel communicates with the distributed system through the Livy agent. In addition, the Python version on LIvy supports only the thin Python kernel with the 2. X parameter configuration. The thin Python kernel can obtain information such as the development group of the current user and implement automatic configuration after the user specifies the calculation priorityCopy the code

<2> Create a Spark session using myutil integrated with the Python Kernel

Spark_session is like a house. With the house, you can do everything you want in the house. Without the house, you can't do anything (the house is an environment). In the PYTHon3 kernel, use the get_sparkSession () method instead of creating sessions through the PySpark kernel, using the new Get_sparkSession method provided by Myutil.Copy the code

   import myutil
   # arguments are not required because they have default values
   spark_session = myutil.get_sparksession(app_name,priority,conf) 
  
   The # get_sparkSession () function takes three arguments, none of which are required and all have default values.
   App_name Name of apark job created
   # priority Spark job priority
   # conf # other parameters
  
   conf = {"spark.driver.cores": 5.# Default 1, maximum 40
           "spark.driver.memory": "10g".Default 5g, maximum 80G
           "spark.executor.memory": "10g".Default 5g, maximum 80G
           "spark.executor.instances": 8.By default, the number of Executors is automatically adjusted according to the number of tasks
           "spark.executor.cores": 5 # default 2, maximum 40}
Copy the code

Spark_session.stop () spark_session.stop()Copy the code

<3> Read data from the Hive table

After the SPARk_session is generated, you can use the spark SQL command to query Hive data. The example code is as follows:Copy the code

   # But note: The spark dataframe structure is returned, not the Dataframe structure in Pandas
   Use the toPandas method if local processing is required

   Generate spark DataFrame
   import myutil
   # Conf is not required, you can default, if default, use the default conf
   spark_session = myutil.get_sparksession(app_name='xxx',priority=? ,conf=dict) 
   sql_command = "select * from ks_hr.jupyter_gao_test"
   sdf = spark_session.sql(sql_command)
   sdf.show()
Copy the code

<4> Local processing of spark Dataframe

After querying hive tables using the Spark SQL () function, the Spark Dataframe data object is obtained, not the Pandas data object. Because it is the Spark dataframe, you can export it to the local Pandas Dataframe and then perform the following operations, such as drawing and analysisCopy the code

    # Save the cost to pandas DataFrame
    import myutil
    spark_session = myutil.get_sparksession(app_name,priority,conf) 
    sql_command = "select * from ks_hr.jupyter_gao_test"
    sdf = spark_session.sql(sql_command)
    sdf.show()
    
    data_df = sdf.toPandas() # spark dataframe --> pandas dataframe
    print( data_df.head() )
Copy the code

Why convert spark dataframe to Pandas Dataframe? After the data is processed by the Spark computing engine, it can be converted to Pandas Dataframe, and then drawn and analyzed.Copy the code

<5> Spark reads HDFS files

   import myutil
   spark_session = myutil.get_sparksession()
     
   # viewfs HDFS file: / / hadoop - lt - cluster/home/HDP/gaozhenmin/fortune500 CSV
   data = spark_session.read.format("csv").option("header"."true").load("/home/hdp/gaozhenmin/fortune500.csv")
   data.show()
Copy the code

<6> Add a third-party library to the PySpark cluster environment

Myutil uses PySpark to use third-party libraries directly. You only need to install them locally. Note: If there is a compiled. So file in the installation directory of a third-party library (such as Pyarrow), you cannot upload it in the following ways. Please submit your requirements through the float window button at the lower right corner of the Jupyter pageCopy the code

1 Package the local library

Note: Follow the steps carefully, and note that the package is packaged at the directory level python2.7 package (if python3.7, change the PIP command below to PIp3) Python :2.7.5, numpy:1.16.1, and numpy:1.16.1. PIP show scipy #Name: scipy #Version: 1.2.1 #Summary: scipy: Scientific Library for Python #Home-page: https://www.scipy.org #Author: None #Author-email: None #License: BSD # Location: / home/gaozhenmin/local/lib/python2.7 / site - packages # the Requires: Numpy # go to the corresponding directory (enter the corresponding library folder), package the library file, Package command to zip -r/TMP / < package name >. Zip package name / * CD ~ /. Local/lib/python2.7 / site - packages zip - r/TMP/scipy. Zip scipy / * # Zip #-rw-rw-r-- 1 gaozhenmin HDP 28168027 May 18 14:31 / TMP /scipy.zipCopy the code

2 Set spark conf

    conf = {"spark.submit.pyFiles":"/tmp/scipy.zip"}
    spark_session = get_sparkseesion(conf=conf)
Copy the code

<6> Submit the Spark ML job

Here is the example code:Copy the code

   import myutil
   from pyspark.ml.classification import LogisticRegression # Import model

   # Load training data
   spark_session = myutil.get_sparksession(app_name,priority,conf) 
   training = spark_session.read.format("libsvm").load("xxx.txt")
   lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

   # Fit the model
   lrModel = lr.fit(training)
   trainingSummary = lrModel.summary
   print("areaUnderROC: " + str(trainingSummary.areaUnderROC))
   fMeasure = trainingSummary.fMeasureByThreshold
   maxFMeasure = fMeasure.groupBy().max('F-Measure').select('max(F- Measure)').head()
   bestThreshold = fMeasure.where(fMeasure['F-Measure'] == maxFMeasure['max(F-Measure)']).select('threshold').head()['threshold']
   lr.setThreshold(bestThreshold)

   # We can also use the multinomial family for binary classification
   mlr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0.8)

   # Fit the model
   mlrModel = mlr.fit(training)
   trainingSummary = mlrModel.summary
   print("areaUnderROC: " + str(trainingSummary.areaUnderROC))
Copy the code

<7> Use spark-submit to submit the Spark job

If you want to use a method other than PySpark SQL or PySpark ML to submit a Spark cluster job, you can use spark-submit to submit the Spark job. For details, see Spark-submitCopy the code

<8> Description of Spark Magic

Sparkmagic is a set of tools used interactively through Livy, a Spark REST server, on laptops in Jupyter. This project includes a set of interactively running Spark code and some cores that can be used to convert Jupyter into an integrated Spark environment. The Spark session is temporarily unavailable when the Current Python kernel starts it.Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Jupyter Notebook: PySpark large data processing

1/ Jupyter Notebook’s current computing types include

<1> Local mode (i.e. stand alone mode)

<2> If a larger amount of data is to be calculated, distributed computing is recommended (submitted to the Spark distributed computing engine for calculation).

<3> According to the above <1> and <2> :

2/ Local Computing (stand alone mode)

3/ Spark Distributed computing engine: Submit spark jobs

<1> Submit spark jobs using the PYTHon3 kernel + integrated Myutil method

<2> Create a Spark session using myutil integrated with the Python Kernel

<3> Read data from the Hive table

<4> Local processing of spark Dataframe

<5> Spark reads HDFS files

<6> Add a third-party library to the PySpark cluster environment

1 Package the local library

2 Set spark conf

<6> Submit the Spark ML job

<7> Use spark-submit to submit the Spark job

<8> Description of Spark Magic

Jupyter Notebook: PySpark large data processing

1/ Jupyter Notebook’s current computing types include

<1> Local mode (i.e. stand alone mode)

<2> If a larger amount of data is to be calculated, distributed computing is recommended (submitted to the Spark distributed computing engine for calculation).

<3> According to the above <1> and <2> :

2/ Local Computing (stand alone mode)

3/ Spark Distributed computing engine: Submit spark jobs

<1> Submit spark jobs using the PYTHon3 kernel + integrated Myutil method

<2> Create a Spark session using myutil integrated with the Python Kernel

<3> Read data from the Hive table

<4> Local processing of spark Dataframe

<5> Spark reads HDFS files

<6> Add a third-party library to the PySpark cluster environment

1 Package the local library

2 Set spark conf

<6> Submit the Spark ML job

<7> Use spark-submit to submit the Spark job

<8> Description of Spark Magic

Related Posts

An incomplete guide to macOS package management tool Homebrew

Correct Loki alarm posture

Java interview question: Baidu the first 200 pages are here