sequence
As the year of 2019 is coming to an end, it is time to summarize the past and look into the future. Recently, I have been sorting out the functions of Spark with my friends for further study and learning.
The body of the
Don’t you collect such a big picture
Spark Function Points (text version)
Spark Streaming
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant Streaming of real-time data. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and complex algorithms can be expressed with high-level capabilities like Map, Reduce, Join, and Window. Finally, processed data can be pushed to file systems, databases, and real-time dashboards. In fact, Spark’s machine learning and graphics processing algorithms can also be applied to data streams
DStreams
Discretized Streams
Discrete stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous data stream, either an input data stream received from a source or a processed data stream generated by transforming the input stream. Internally, DStream is represented by a series of sequential RDD’s, which is Spark’s abstraction of immutable distributed data sets.
Each RDD in DStream contains data from a specific interval
-
Input DStreams and Receivers
-
Receiever
Each input DStream is associated with a Receiver object, which receives data from the source and stores it in Spark’s memory for processing.
- reliable
- unreliable
-
Base source (StreamingContext API provides Receiver)
- The file system
- Socket connection
-
Advanced Source (custom Receiver)
- Kafka
- Flume
- Kinesis
-
Custom data source
-
Queue of RDDs as a Stream
-
-
Output Operations on DStreams
-
Hdfs
-
DataBases
- Hbase
- MySql
-
DashBoards
-
-
DataFrame and SQL Options
-
Transformations on DStreams
-
RDD-to-RDD
-
Join operation
- Stream-stream joins
- Stream-dataset joins
-
The Windows operating
- reduceByWindow
- reduceByKeyAndWindow
- countByValueAndWindow
- .
-
-
graph processing algorithms
-
MLlib
-
streaming ML algorithms
- Streamline regression
- Flow KMeans
- .
-
Offline data training model is applied to stream data online
-
Stream data training model
- The trained model is applied online to stream data
-
Fault-tolerant mechanism
-
Cashing/Persistence
- Persist (cache) to memory
- Setting the Persistence Level
-
Fault tolerance semantics
-
Three kinds of guarantee
- At most once
- Exactly once
- At least once
-
-
CheckPoint
-
Metadata checkpointing
Save the information that defines the flow computation to fault-tolerant storage, such as HDFS, for recovery from the failure of the node where the driver of the flow application is running
-
Data checkpointing
This is necessary for some state transitions that combine data across multiple batches. In such transformations, the generated RDD relies on the RDD of the previous batch, which results in the length of the dependency necklace increasing over time. To avoid this unrestricted increase in recovery time (proportional to the dependency chain), stateful intermediate RDD periodically checks to a reliable store (such as HDFS) to break the dependency chain.
-
-
Accumulators, Broadcast Variables, and Checkpoints
Unable to recover accumulator and broadcast variables from checkpoints in Spark Streaming. If checkpoints are enabled and either accumulator or broadcast variables are used together, delay-instantiated singleton instances of accumulator and broadcast variables must be created so that they can be re-instantiated after a driver failure restarts
- Create a lazy instantiation singleton instance
Core
RDDs
-
RDD operator
- Transformations
- Actions
-
Set parallelization
-
External data set
- Local File
- HDFS
- Cassandra
- HBase
- S3
- other Hadoop InputFormat
-
persistence
-
The cache
-
Storage level
-
MEMORY_ONLY
-
MEMORY_AND_DISK
-
MEMORY_ONLY_SER
- Java and Scala
-
MEMORY_AND_DISK_SER
- Java and Scala
-
DISK_ONLY
-
MEMORY_ONLY_2 etc.
-
OFF_HEAP
- off-heap memory
-
-
Fault tolerance
- Lineage
-
LRU
- Removing Cache Data
-
closure
- Local vs. cluster modes
Shuffle operation
- Sorted Based Shuffle
Shared variables
-
Radio variable
- Broadcast Variables
-
accumulator
- Accumulators
Spark SQL
SparkSession
- Spark application configuration
- Creating DataSet/DataFrame
- SQL API
- SparkContext
- The catalog metadata
Dataset/DataFrame
-
Creating DataSet/DataFrame
-
RDD
-
External data source
- HDFS
-
-
Data Sources
-
The LOAD/SAVE operation
-
Store format files in columns
- Parquet Files
- ORC Files
-
Unstructured data files
- CSV Files
- Text Files
-
Semi-structured data files
- JSON Files
-
Unstructured database
- Hbase
-
Structured database
-
Mysql
- JDBC/ODBC
- Spark SQL CLI
-
Hive
- SparkSession.enableHiveSupport
- Spark SQL CLI
-
-
-
The DataSet/DataFrame/RDD conversion
-
DataSet/DataFrame API
-
The SQL statement
-
The catalog metadata
- Hive metastore
- In-memory
-
SQL Language Categories
- DDL
- DML
- DQL
- DCL
-
Built-in function
- A single function
- Aggregation function
-
-
DSL statement
-
Domain-specific language
- .filter
- .select
-
-
-
schema API
- Show the schema
- To define the schema
UDF
-
UDAF
- Untyped UDAF
- Typed UDAF
-
UDTF
Spark SQL on Hive
- The Hive framework has been deployed
- The Hive framework is not deployed
Performance Tuning
- Caching Data In Memory
- Other Configuration Options
- Broadcast Hint for SQL Queries
Spark SQL technical points
-
Execute the process
-
Logical Plan
-
SQL Parse
-
UnresolvedLogicalPlan
- DDLParser
- SqlParser
- SparkSQLParser
-
-
Catalyst Analyzer
- ResolvedLogicalPlan
-
Catalyst Optimizer
- OptimizedLogicalPlan
-
-
SparkPlanner
-
Parent: QueryPlanner
-
Strategies
- LeftSemiJoin
- HashJoin
- PartialAggregation
- BroadcastNestedLoopJoin
- TakeOrdered
-
-
Physical Plan
-
prepareForExecution
- Executable PhysicalPlan
-
execute
- SchemaRDD
-
-
-
Performance optimization
-
Reduce or avoid the Shuffle process
- Unified Zoning Policy
- Broadcast mechanism
-
Data skew
- A custom Partitioner
-
Memory cached data
-
The core attributes
- spark.driver.cores
- spark.driver.memory
- spark.executor.cores
- spark.executor.memory
- spark.sql.shuffle.partitions
-
Delta Lake
The data of lake
MLlib
Spark AI
GraphX
Spark Graph
Structured Streaming
Streaming Dataset/Streaming DataFrame
-
Creating Streaming Dataset/Streaming DataFrame
-
File source
- File generation workflow
- Generate automatic partition workflow for files and folders
-
Kafka source
- Streaming mode generates a workflow
- Batch mode generates the workflow
-
Socket source
-
Rate source
-
-
SQL query
-
Window Operations on Event Time
- Window generation rule
- Window working mode
-
Watermark
- Updata Mode
- Append Mode
- Complete Mode(must be aggregated in advance)
-
Streaming Deduplication
-
Join Operations
- Stream-static Data join operation
- Stream-stream Indicates the data join operation
-
Trigger
- Trigger.ProcessingTime(time interval)
- Trigger.Once()
- The Trigger. Continuous (Checkpoint interval)
-
Manage and monitor workflow
- Managing Streaming Queries
- Monitoring Streaming Queries
-
Checkpoint
Continuous Processing
-
Sources
- Kafka source
- Rate source
-
Sinks
- Kafka sink
- Memory sink
- Console sink
monitoring
StreamingListener interface monitoring
- Receiver state
- The processing time
Job Web UI
-
Streaming tab
When using StreamingContext, the Spark Web UI displays an additional Streaming TAB
-
A running receiver
- Receiver state
- The number of records received
- Receiver error
-
Batch processing completed
- Batch time
- Queueing delay
-
-
4040 etc.
The port is active. If 4040 is occupied, the port will be extended to 4041,4042,4043…
- Stages and Tasks
- Rdd size and memory usage
- Environmental information
- running executors
-
eventLog.enabled
- true
Spark’s History Server
-
port
- 18081
-
kerberos
-
Access control
- acls
Rest API
-
ip:18081/api/v1
-
History service
- app-id
-
-
ip:4040/api/v1
- Running tasks
Index system
-
Dropwizard index library
- The metrics. Dropwizard. IO / 4.4.1
-
The receiver
-
Http
-
JMX
-
CSV
-
Ganglia
-
Licensing restrictions
- LGPL license
- Custom Installation
-
-
Advanced analysis
-
System analysis tool
- dstat
- iostat
- iotop
-
The JVM tools
- jstack
- jmap
- jstat
- jconsole
Deployment way
Local
YARN
Mesos
Standalone
Kubernetes
Programming interface (API)
Scala
Java
Python
R
SQL
SQL does not refer to a specific programming language. Spark SQL is the Spark module for structured data processing. 2.x introduced the ANSI-SQL parser to provide parsing capabilities for standardized SQL.
-
Built-in Funcs
- Spark.apache.org/docs/latest…
Shell
Ps: The GraphX and MLlib components are not described in the figure
Whisper: The original image exported by Mind is too large to be clear when uploaded, so the text is listed.