1. Data transfer between DataFrame and DataSet

1. The DataFrame DataSet

1) Create a DateFrame

scala> val df = spark.read.json("/opt/module/spark-local/people.json")

df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
Copy the code

2) Create a sample class

scala> case class Person(name: String,age: Long)
defined class Person
Copy the code

3) Convert DataFrame to DataSet

scala> df.as[Person]

res5: org.apache.spark.sql.Dataset[Person] = [age: bigint, name: string]
Copy the code

This method is to use the AS method to convert the Dataset after the type of each column is given, which is very convenient when the data type is DataFrame and processing for each field is needed. When using special operations, you must add import spark. Implicits. _ otherwise, toDF and toDS cannot be used.

2. Turn the DataSet DataFrame

1) Create a sample class

scala> case class Person(name: String,age: Long)
defined class Person
Copy the code

2) create a DataSet

scala> val ds = Seq(Person("zhangwuji".32)).toDS()

ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
Copy the code

3) Convert DataSet to DataFrame

scala> var df = ds.toDF
df: org.apache.spark.sql.DataFrame = [name: string, age: bigint]
Copy the code

4) show

scala> df.show
+---------+---+
|     name|age|
+---------+---+
|zhangwuji| 32|
+---------+---+
Copy the code

2. Relationship between RDD,DataFrame, and DataSet

In SparkSQL Spark provides us with two new abstractions, DataFrame and DataSet. How are they different from RDD? First of all, from the generation of version:

RDD (Spark1.0) – > Dataframe(Spark1.3) – > Dataset(Spark1.6)

If the same data is given to all three data structures, they will all give the same result when they compute separately. The difference is their execution efficiency and the way of execution. In later versions of Spark, DataSet may gradually replace RDD and DataFrame as the only API interface.

1. Commonness of the three

  • 1)RDD, DataFrame and DataSet are all distributed elastic data sets under Spark platform, facilitating processing of super-large data;
  • 2) All of them have a lazy mechanism. When creating or transforming a method, such as a map method, it will not be executed immediately. Only when an Action, such as foreach, is encountered, will the three begin traversal operation.
  • 3) The three have many common functions, such as filter, sorting, etc.
  • Import spark.implicits._ (try to import SparkSession objects directly after they have been created)
  • 5) All three items are automatically cached according to Spark’s memory, so that even if there is a large amount of data, there is no need to worry about memory overflow
  • 6) All three have the concept of partition
  • 7) Both DataFrame and Dataset can use pattern matching to obtain the value and type of each field

2. The difference between the three

  1. RDD
  • RDD is used together with Spark MLib
  • RDD does not support SparkSQL operations
  1. DataFrame
  • Unlike RDD and Dataset, DataFrame has a fixed Row type for each Row, and the values of each column cannot be accessed directly. The values of each column can only be obtained through parsing
  • DataFrame and DataSet are not used together with Spark MLib
  • Both DataFrame and DataSet support SparkSQL operations, such as SELECT, groupBY, and register temporary tables/Windows for SQL statement operations
  • DataFrame and DataSet support some convenient saving methods, such as CSV, can be saved with the table header, so that each column of the field name can be seen ata glance.
  1. DataSet
  • Dataset and DataFrame have exactly the same member functions, only the data type of each row is different. A DataFrame is a special case of a DataSet

type DataFrame = Dataset[Row]

  • A DataFrame can also be called a Dataset[Row], where the type of each Row is Row and the type of each field is not resolved. In the Dataset, the type of each row is not certain. After the case class is defined, the information of each row can be freely obtained

3. The three rotate with each other