New Features of Spark

  1. DataFrame = Dataset[Row]

  2. SparkSession: unified SQLContext and HiveContext, new context entry

  3. Support for caching and out-of-heap memory management for program execution

  4. Accumulator features are enhanced with convenient API, Web UI support and higher performance

  5. whole-stage code generation Spark 2.0 comes with a second-generation Tungsten engine that eliminates calls to virtual functions and uses CPU registers to store intermediate data by optimizing code that slows down the entire query at run time into a single function (complex code might be packaged into multiple functions) This is done by grouping different operators together to produce a single block handler where possible. This has the advantage of eliminating redundant data passing between operators (which can be a major performance cost), and a large code segment that does not cross function boundaries will be more likely to be JIT-optimized by the JVM. The ultimate goal is for generated code to be approximately as fast as handwritten code. Explain () can see the execution plan of the code generated by whole-stage code generation, with * indicating automatically generated code eg:

    select id + 1 from students where id <> 0;
    Copy the code

    The calculation part of the Plan above will have two operators: ID +1 for the Projection Operator and ID <> 0 for the Filter Operator. The new Spark code generation can combine these two parts to produce a large code block for calculation

  6. Vectorization can only be used when whole-stage code generation technology is not available (such as the complex action operator and the use of third-party component code). The core idea of Vectorization technology is as follows: Rather than processing a single row of data at a time, batches of data are grouped into separate batches and stored in column format. Then each operator runs a simple loop through each batch to iterate over the data. So each call to the next() function returns a batch of tuples, which spreads the overhead of the virtual function call. With these measures in place, these simple loops also make the compiler and CPU run more efficiently. Vectorization still requires temporary data to be stored in memory rather than in CPU registers. So vectorization is used only when whole-stage code-generation is not available. Used in the decoding part of, for example, Parquet Reader

  7. Structured Streaming

1.1 spark1 VS spark2

Spark2.x focuses on the transformation of Tungsten Engine, Dataframe/Dataset and Structured Streaming

Spark 1.x: Spark Core (RDD), Spark SQL (SQL+Dataframe+Dataset), Spark Streaming, Spark MLlib, Spark Graphx

Spark 2 x: Spark Core (RDD), Spark SQL (ANSI-SQL+Subquery+Dataframe/Dataset), Spark Streaming, Structured Streaming, Spark MLlib (Dataframe/Dataset), Spark Graphx, Second Generation Tungsten Engine (whole-stage Code Generation +Vectorization)

1.2 RDD Application Scenarios

  1. Low-level Operations Manual management of RDD partitions, low-level tuning, and Troubleshooting
  2. Processing unstructured data such as multimedia data, text data
  3. Performance optimization techniques such as whole-stage code generation provided by the second-generation Tungsten engine are not used

1.3 the Dataset VS RDD

The serialization mechanism of RDD is based on Java serialization mechanism or Kryo’s, while the serialization mechanism of Dataset is based on a special Encoder to efficiently serialize objects for high-performance processing or transmission through the network. In addition to Encoder, Dataset also supports Java serialization mechanism, but Encoder features dynamic code generation and provides a special data format, which enables Spark to perform common operations based on binary data without deserializing objects. For example, filter, sort, and hash.

1.4 Dataframe/Dataset Application scenarios

  1. Richer computational semantics, high-level abstract semantics, and domain-specific apis are needed.
  2. The calculation logic requires high-level expression.filter.map.aggregation.average. Sum. SQL. Lambda expression semantics, to deal with semi – structured or structured data.
  3. A high degree of compile-time and runtime type safety is required.
  4. You want to improve performance with SparkSQL’s Catalyst(SparkSQL’s optimizer, similar to the database engine’s execution plan optimization module) and Spark 2.x’s second-generation Tungsten engine.
  5. It wants a unified API for offline, streaming, machine learning computing operations.
  6. R or Python users can only use Dataframe.

1.5 Dataframe VS Dataset

Dataframe = Dataset[Row]

  • Untyped API(Dataframe) A Row is an object of untyped type. It is just like a Row in a database. Only the columns are known. Therefore, it is defined as untyped, that is, a weakly typed DataFrame, due to its API model that is close to SQL, will fully enjoy the Catalyst optimizer, whereas a native RDD, due to its overly free API, will not enjoy the optimized DataFrame because the Schema has not been injected at the time of parsing. There is no complete static type checking and therefore no type safety.

  • Typed API(Dataset) Dataset[T] itself is an API of the typed type. Its objects are always user-defined objects. So datasets that are strongly typed, including field names and field types, support the Lambda API, have type support and enjoy Catalyst optimization. Since it also has Schema support, it can be serialized to Native memory (instead of Java heap space) according to the type, without using Java Objects, which is faster and saves memory

2. The spark SQL

Dataframe untyped operation

val df = spark.read.json("people.json")
df.show()
df.printSchema()
df.select("name").show()
df.select($"name", $"age" + 1).show()
df.filter($"age" > 21).show()
df.groupBy("age").count().show()

df.createOrReplaceTempView("people")
val sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()
Copy the code

3. Structured Streaming

3.1 Comparison of Structured Streaming and other Streaming computing applications

attribute Structured Streaming Spark Streaming Apache Storm Apache Flink Kafka Stream Google Dataflow
Streaming API Incrementally perform batch calculations Based on batch computing engine Has nothing to do with batch processing Has nothing to do with batch processing Has nothing to do with batch processing Based on batch computing engine
Guarantee of computational integrity based on data location prefix Square root Square root x x x x
Congruent semantics exactly once exactly once exactly once exactly once at least once exactly once
Transactional operation storage support Square root Part of the Part of the Part of the x x
Interactive query Square root Square root Square root x x x
Join with static data Square root Square root x x x x

The output mode

  • Complete mode: The data in the updated result table is written to external storage
  • Append mode: Only data added to the Result table after the last trigger is written to external storage. Use append mode only when you are sure that the existing data in the result table will not be changed.
  • Update mode: Only data that has been updated in the result table since the last trigger, including additions and modifications, will be written to external storage.

The resources

  1. Spark 2.0 goes from beginner to master