1. Connect the mysql

– driver – class – path mysql connector – Java – 5.1.21. JarIn the database, SET GLOBAL binlog_format=mixed;

2.Spark uses the Hive UDF

Use – jars as well

3. The Spark jupyter use

www.jb51.net/article/163…

My.oschina.net/albert2011/…

Use jupyter-notebook — IP hostname -i to start

4.Spark uses hive ORC resolution format

spark.sql.hive.convertMetastoreOrc=true

If Spark is used to write data in the Hive table, null Pointers or data out of bounds may occur. The cause is spark metadata parsing, not Hive metadata parsing

5. Use of row_number sorting operator

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
import org.apache.spark.sql.functions._

Copy the code

1.spark.sql(sql).withColumn(“rn”, row_number().over(Window.partitionBy(‘f_trans_id).orderBy(col(“f_modify_time”).desc))) 2.spark.sql(sql).withColumn(“rn”, row_number().over(Window.partitionBy(‘f_trans_id).orderBy(-col(“f_modify_time”))))

3.val df = spark.sql(sql)

df.withColumn(“rn”, row_number().over(Window.partitionBy(‘f_trans_id).orderBy(-df(“f_modify_time”))))

4.spark.sql(sql).withColumn(“rn”, row_number().over(Window.partitionBy(‘f_trans_id).orderBy(-‘f_modify_time)))

Note: – the way, after testing, is unstable, sometimes it works, sometimes it doesn’t

6. Broadcast radio table

Sc. broadcast is broadcast data and is generally used for RDD broadcast. The following method is used for broadcast table

import org.apache.spark.sql.functions.broadcast

Broadcast (tableData).createOrReplaceTempView View personal information.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Big data Combat -Spark combat skills

1. Connect the mysql

2.Spark uses the Hive UDF

3. The Spark jupyter use

4.Spark uses hive ORC resolution format

5. Use of row_number sorting operator

6. Broadcast radio table

Big data Combat -Spark combat skills

1. Connect the mysql

2.Spark uses the Hive UDF

3. The Spark jupyter use

4.Spark uses hive ORC resolution format

5. Use of row_number sorting operator

6. Broadcast radio table

Related Posts

Taobao big second kill system design details

A simple trick to get all the apis without having to query UI5 documents

Nginx solves the cross-domain problem of embedding third-party pages