Apache Spark 3.0 builds on Spark 2.x and brings new ideas and features.
Spark is an open source unified engine for big data processing, data science, machine learning, and data analysis workloads that has grown into one of the most active open source projects since it was first released in 2010; Java, Scala, Python, R, and other languages are supported, and SDKS are provided for these languages.
Spark SQL in Spark 3.0 is the most active component in this release, with 46% of the issues resolved being for Spark SQL, including structured streams and MLlib, and high-level apis, including SQL and DataFrames. After a lot of optimization, Spark 3.0 is about two times faster than Spark 2.4.
Python is by far the most widely used language on Spark; PySpark, which is available for Python, has more than 5 million monthly downloads on PyPI. In Spark 3.0, a number of improvements were made to the functionality and usability of PySpark, including a redesign of the PANDAS UDF API with Python type prompts, the new PANDAS UDF type, and more Pythonic error handling.
Here are some of the highlights of Spark 3.0: adaptive query execution, dynamic partition trims, ANSI SQL compliance, major improvements to the PANDAS API, a new UI for structured streams, a 40-fold increase in the speed of calling R user-defined functions, accelerater-aware schedulers, and SQL reference documents.
These functions can be divided into the following modules:
Core, Spark SQL, and Structured Streaming
MLlib
SparkR
GraphX
- To give up
Python 2
andR 3.4
The following versions are supported; - Fix some known problems;
Core, Spark SQL, and Structured Streaming
Prominent feature
- Accelerator sensing scheduler;
- Adaptive query;
- Dynamic partition pruning;
- redesigned
pandas UDF API
And type hints; - Structured flow user interface;
- The plug-in directory
API
Support; - support
Java 11
; - support
Hadoop 3
; - Better compatibility
ANSI SQL
;
Performance improvement
- Adaptive query;
- Dynamic partition pruning;
- To optimize the
9
Rules; - Minimize table cache synchronization performance optimization;
- Split the aggregate code into small functions;
- in
INSERT
andALTER TABLE ADD PARTITION
Added batch processing to the command; - Allows aggregators to register as
UDAF
;
SQL Compatibility enhancement
- use
Proleptic Gregorian
The calendar; - To establish
Spark
Own date-time schema definition; - Introduce for table inserts
ANSI
Storage allocation strategy; - Follow by default in table inserts
ANSI
Storage allocation rules; - Add a
SQLConf
:spark.sql.ansi.enabled
, used to openANSI
Mode; - Supports aggregate expressions
ANSI SQL
Filter clause; - support
ANSI SQL OVERLAY
Function; - support
ANSI
Nested comments in square brackets; - Throws an exception outside the integer range;
- Overflow check of interval arithmetic operation;
- Throws an exception when an invalid string is converted to a numeric type;
- The overflow behavior of interval multiplication and division is consistent with other operations.
- for
char
anddecimal
addANSI
Alias for the type; SQL
The parser is definedANSI
Compatible reserved keywords;- when
ANSI
When the mode is enabled, do not use reserved keywords as identifiers. - support
ANSI SQL.LIKE... ESCAPE
Grammar; - support
ANSI SQL
Boolean predicate syntax;
PySpark enhanced version
- redesigned
pandas UDFs
And provide type hints; - allow
Pandas UDF
usingpd.DataFrames
Iterator of; - support
StructType
As aScalar Pandas UDF
Parameter and return type of; - through
Pandas UDFs
supportDataframe Cogroup
; - increase
mapInPandas
To allowDataFrames
Iterator of; - Part of the
SQL
Functions should also take data column names; - let
PySpark
theSQL
Abnormal morePythonic
;
Scalability enhancement
- Directory plug-in;
- The data source
V2 API
Refactoring; Hive 3.0
and3.1
Version of metamodel support;- will
Spark
Plug-in interfaces extend to drivers; - This can be extended by customizing metrics
Spark
Index system; - Provides developers for extending column processing support
API
; - use
DSV2
Built-in source migration for:parquet, ORC, CSV, JSON, Kafka, Text, Avro
; - Allows for the
SparkExtensions
Injection function;
Connector enhancement
- Supported in data source tables
spark.sql.statistics.fallBackToHdfs
; - upgrade
Apache ORC
to1.5.9
; - support
CSV
Data source filter; - Use local data sources to optimize insert partitions
Hive
Table; - upgrade
Kafka
toAgainst 2.4.1
; - New built-in binary data sources, new operation-free batch data sources and operation-free stream receivers;
The native Spark application in K8s
- use
K8S
For more sensitive dynamic allocation, and inK8S
To add toSpark
theKerberos
Support; - use
Hadoop
Compatible file systems support client dependencies; - in
k8s
Add a configurable authentication secret source in the background. - support
K8s
Subpath mount; - in
PySpark Bindings for K8S
In thePython 3
As the default option;
MLib
- for
Binarizer
,StringIndexer
、StopWordsRemover
和PySpark QuantileDiscretizer
Added support for multiple columns; - Support tree-based feature transformation;
- Two new estimators have been added
MultilabelClassificationEvaluator
andRankingEvaluator
; - increased
PowerIterationClustering
theR API
; - Added a function to track the state of ML pipes
Spark ML
The listener; - in
Python
A fitting with a verification set was added to the gradient ascending tree in. - increased
RobustScaler
Transformer; - Factorized machine classifier and regressor are added.
- Gauss Naif Bayes and complementary Naif Bayes are added.
In addition, the Spark 3.0, Pyspark multiple classes in logistic regression returns LogisticRegressionSummary now, rather than its subclasses BinaryLogisticRegressionSummary; Pyspark. Ml. Param. Shared. From the * mixins are no longer provide any set (self, value) setter methods, but use, is their own self. The set (self., value) instead.
SparkR
The interoperability of SparkR is optimized by vectorizing R gapply(), dapply(), createDataFrame, collect() to improve performance.
There’s the eager Execution R shell, the IDE, and the R API for iterative clustering.
Deprecated components
- deprecated
Python 2
Support; - deprecated
R 3.4
The following versions are supported; - deprecated
Deprecate UserDefinedAggregateFunction
;
Spark 3.0 is also a big release, with a number of new features, fixes for known issues and significant performance improvements.
Since Python officially announced that it would stop maintaining Python2, various components have also responded by discontinuing Python support. Those who are studying Python in the project can also consider learning Python 3 directly.
Although the old man is not serious, but the old man a talent! Follow me for more knowledge of programming technology.