1. Connect the mysql

– driver – class – path mysql connector – Java – 5.1.21. JarIn the database, SET GLOBAL binlog_format=mixed;

2.Spark uses the Hive UDF

Use – jars as well

3. The Spark jupyter use

www.jb51.net/article/163…

My.oschina.net/albert2011/…

Use jupyter-notebook — IP hostname -i to start

4.Spark uses hive ORC resolution format

spark.sql.hive.convertMetastoreOrc=true

If Spark is used to write data in the Hive table, null Pointers or data out of bounds may occur. The cause is spark metadata parsing, not Hive metadata parsing

5. Use of row_number sorting operator

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
import org.apache.spark.sql.functions._

Copy the code

1.spark.sql(sql).withColumn(“rn”, row_number().over(Window.partitionBy(‘f_trans_id).orderBy(col(“f_modify_time”).desc))) 2.spark.sql(sql).withColumn(“rn”, row_number().over(Window.partitionBy(‘f_trans_id).orderBy(-col(“f_modify_time”))))

3.val df = spark.sql(sql)

df.withColumn(“rn”, row_number().over(Window.partitionBy(‘f_trans_id).orderBy(-df(“f_modify_time”))))

4.spark.sql(sql).withColumn(“rn”, row_number().over(Window.partitionBy(‘f_trans_id).orderBy(-‘f_modify_time)))

Note: – the way, after testing, is unstable, sometimes it works, sometimes it doesn’t

6. Broadcast radio table

Sc. broadcast is broadcast data and is generally used for RDD broadcast. The following method is used for broadcast table

import org.apache.spark.sql.functions.broadcast

Broadcast (tableData).createOrReplaceTempView View personal information.