1. Connect the mysql
– driver – class – path mysql connector – Java – 5.1.21. JarIn the database, SET GLOBAL binlog_format=mixed;
2.Spark uses the Hive UDF
Use – jars as well
3. The Spark jupyter use
www.jb51.net/article/163…
My.oschina.net/albert2011/…
Use jupyter-notebook — IP hostname -i to start
4.Spark uses hive ORC resolution format
spark.sql.hive.convertMetastoreOrc=true
If Spark is used to write data in the Hive table, null Pointers or data out of bounds may occur. The cause is spark metadata parsing, not Hive metadata parsing
5. Use of row_number sorting operator
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
import org.apache.spark.sql.functions._
Copy the code
1.spark.sql(sql).withColumn(“rn”, row_number().over(Window.partitionBy(‘f_trans_id).orderBy(col(“f_modify_time”).desc))) 2.spark.sql(sql).withColumn(“rn”, row_number().over(Window.partitionBy(‘f_trans_id).orderBy(-col(“f_modify_time”))))
3.val df = spark.sql(sql)
df.withColumn(“rn”, row_number().over(Window.partitionBy(‘f_trans_id).orderBy(-df(“f_modify_time”))))
4.spark.sql(sql).withColumn(“rn”, row_number().over(Window.partitionBy(‘f_trans_id).orderBy(-‘f_modify_time)))
Note: – the way, after testing, is unstable, sometimes it works, sometimes it doesn’t
6. Broadcast radio table
Sc. broadcast is broadcast data and is generally used for RDD broadcast. The following method is used for broadcast table
import org.apache.spark.sql.functions.broadcast
Broadcast (tableData).createOrReplaceTempView View personal information.