Learn more about Java basics


Learning Materials: RDD, DataFrame, DataSet relation and conversion (JAVA API) Common methods of DataFrame (code details) Common methods of DataSet (code details)

DataFrame fundamentals

As we know, RDD is an important concept in the early days of Spark. It is an IMmutable collection of data, consisting of partitions on different nodes. A DataFrame, like an RDD, is an immutable, distributed collection of data. The difference is that the data is organized into named columns, just like the tables in a relational database. Is a structured high-level abstraction that provides a domain-specific language (DSL) API to manipulate this distributed data.

After Spark2.0, the API of DataFrame and DataSet were unified. DataFrame is equivalent to DataSet[Row]. Now you just need to deal with the DataSet related apis.

The limitation of DataFrame

  • No compile-time type checking: security cannot be checked at compile time, and users are restricted from manipulating data with unknown structures. For example, the following code compiles with no errors, but executes with an exception:
case class Person(name : String , age : Int) 
val dataframe = sqlContect.read.json("people.json") 
dataframe.filter("salary > 10000").show 
=> throws Exception : cannot resolve 'salary' given input age , name
Copy the code
  • Cannot preserve the structure of a class object: Once an object of a class structure is converted to a Dataframe, it cannot be converted back. The chestnut below points it out:
case class Person(name : String , age : Int)
val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
val personDF = sqlContect.createDataframe(personRDD)
personDF.rdd // returns RDD[Row] , does not returns RDD[Person]
Copy the code

DataSet

The Dataset API is an extension of the DataFrame to support type-safe checking and programming interfaces for class-structured objects. It is strongly typed, immutable collection, and maps to a related schema. At the heart of the Dataset API is a concept called Encoder. It is responsible for translating between the JVM’s objects and tabular representations. Tabular representations use Spark’s built-in Tungsten binary form when stored, allowing manipulation of serialized data and improved memory usage. After Spark 1.6, automatic Encoder generation is supported. Encoders can be automatically generated for a wide range of Primitive types (such as String, Integer, and Long), Scala case classes, and Java beans.

Dataset[Row] DataFrame = Dataset[Row] DataFrame = Dataset[Row] DataFrame = Dataset[Row] DataFrame = Dataset[Row

In the Dataset, the type of each row is not certain. After the case class is defined, the information of each row can be freely obtained

DataFrame and DataSet creation

Before park2.0, different entry classes were used for different data types. For example, SparkContext was used for RDD, SQLContext was used for DataFrame, and HiveContext was used for specifying Spark cluster parameters. And interact with the resource manager.

At the beginning and after Spark2.0, SparkSession, a unified entry class, encapsulates SparkContext, SQLContext and HiveContext. RDD, DataFrame, DataSet relation and mutual conversion (JAVA API)

  • To create an RDD, you need to get the SparkContext entry class from the SparkSession entry class
  • To create DataFrame and DataSet, use the SparkSession entry class.

Several ways to create a DataFrame

SparkSQL starts and creates dataframes in several ways

  • 1. Create DataFrame by reading json file
  • 2. Create a DataFrame using RDD in JSON format
  • Create DataFrame for RDD in non-JSON format (Important)
    • Convert non-JSON RDD to DataFrame by reflection
    • Dynamically create Schema to convert non-JSON RDD to DataFrame (recommended)

Convert non-JSON RDD to DataFrame by reflection

/** * When passed into the Person.class, the sqlContext creates a DataFrame by reflection. Generates the DataFrame * / DataFrame df = sqlContext createDataFrame (personRDD, Person. Class); df.show(); df.registerTempTable("person");
sqlContext.sql("select name from person where id = 2").show();
Copy the code

Create a Schema dynamically to convert a non-JSON RDD to a DataFrame

/** * Dynamically build metadata in a DataFrame. Generally, fields in DataFrame can be derived from strings. Can also be derived from an external database * / List < StructField > asList = Arrays. The asList (here / / field order must and match them with the above DataTypes. CreateStructField ("id", DataTypes.StringType, true),
   DataTypes.createStructField("name", DataTypes.StringType, true),
   DataTypes.createStructField("age", DataTypes.IntegerType, true)); StructType schema = DataTypes.createStructType(asList); DataFrame df = sqlContext.createDataFrame(rowRDD, schema); df.show();Copy the code