Apache Spark (spark.apache.org/) is a big data processing framework built around speed, ease of use, and complex analysis. It was originally developed in 2009 by AMPLab at the University of California, Berkeley, and became one of Apache’s open source projects in 2010.
Spark is an alternative to MapReduce and is compatible with HDFS and Hive. It can be integrated into the Hadoop ecosystem to compensate for MapReduce.
Characteristics of 1.
Reference: spark.apache.org/
- Spark can increase the running speed of Hadoop cluster applications by 100 times in memory and even by 10 times in disk.
- Spark lets developers quickly write programs in Java, Scala, or Python. It comes with a collection of more than 80 higher-order operators. It can also be used to query data interactively in the shell.
- In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning, and chart data processing. Developers can use a single capability in a data pipeline use case or use them in combination.
- It can run anywhere, in Hadoop, K8S or stand-alone or cloud services, and it can accept all kinds of data.
2. The built-in libraries
2.1 Spark SQL
Spark SQL is the module used by Apache Spark to process structured data.
features
- Seamlessly mix SQL queries with Spark applications.
results = spark.sql(
"SELECT * FROM people")
names = results.map(lambda p: p.name)
Copy the code
- Connect to any data source in the same way.
spark.read.json("s3n://...")
.registerTempTable("json")
results = spark.sql(
"""SELECT * FROM people JOIN json ...""")
Copy the code
- Run SQL or HiveQL queries on an existing repository.
- Connect via JDBC or ODBC.
- Spark SQL includes cost-based optimizers, column storage, and code generation for quick queries. At the same time, it scales to thousands of nodes and multi-hour queries using the Spark engine, which provides full fault tolerance for intermediate queries. Don’t worry about using a different engine to get historical data.
2.2 Spark Streaming
Spark Streaming makes it easy to build scalable fault-tolerant Streaming applications.
features
- Easy to use. Build applications through apis.
- Fault tolerance (out of the box). You can restore the previous working and operating state.
- Combine streaming with batch and interactive queries.
Spark Streaming can read data from HDFS, Flume, Kafka, Twitter and ZeroMQ. In production, Spark Streaming uses ZooKeeper and HDFS for high availability.
2.3 Spark MLlib
MLlib is an extensible machine learning library for Apache Spark.
features
- Easy to use. Available for Java, Scala, Python, and R.
- High quality algorithm, 100 times faster than MapReduce.
- Run anywhere, on Hadoop, Apache Mesos, Kubernetes, standalone or in the cloud, and can target different data sources.
MLlib contains a number of algorithms and utilities.
ML algorithm includes:
Classification: Logistic regression, Naive Bayes,……
Regression: Generalized linear regression, survival regression,……
Decision trees, random forests and gradient ascending trees
Suggestion: Alternate Least square method (ALS)
Clustering: K-means, Gaussian mixture (GMM),……
Topic Modeling: Potential Dirichlet allocation (LDA)
Frequent itemsets, association rules, and sequential pattern mining
The ML workflow tools include:
Feature transformation: normalization, normalization, hash,……
ML pipeline construction
Model evaluation and hyperparameter adjustment
ML persistence: Save and load models and pipes
Other tools include:
Distributed Linear Algebra: SVD, PCA,……
Statistics: Summary statistics, hypothesis testing,……
2.4 Spark GraphX
GraphX is Apache Spark’s API for graphics and parallel graphics computing.
features
-
Flexibility: Seamless use of graphics and collections. Effectively use RDD to transform and join graphs, and write custom iterative graph algorithms using the Pregel API.
-
Speed: Comparable to the fastest professional graphics processing systems.
-
Graphic algorithm. In addition to its highly flexible API, GraphX offers a variety of graphics algorithms, many of which are user-provided. Well-known algorithms are: web ranking link component tag propagation SVD ++ powerful link component triangulation count