The core concept of Spark is RDD, and one of the key features of RDD is immutability to avoid complex parallel problems in distributed environment. This abstraction, in the field of data analysis is no problem, it can maximize the solution of distributed problems, simplify the complexity of various operators, and provide high performance distributed data processing operations.

In machine learning, however, RDD’s weaknesses soon became apparent. The core of machine learning is iteration and parameter updating. RDD can solve the problem of iteration well by virtue of its logical in-memory computing feature. However, the immutability of RDD is very unsuitable for the requirement of repeatedly updating parameters. This intrinsic mismatch has resulted in Spark’s MLlib library, which has been very slow to evolve, with no substantial innovation and poor performance since 2015.

For this reason, Angel gave priority to Spark when designing the ecosphere. When V1.0.0 was released, Spark on Angel was already available. Based on Angel, Spark was added with PS function, adding changes to the constant factor, which is like adding wings to a tiger.

We will take L-BFGS as an example to analyze the problems in implementing Spark’s machine learning algorithm, and how Spark on Angel solves the bottleneck encountered in Spark’s machine learning tasks to make Spark’s machine learning more powerful.

1. L-bfgs algorithm description

The parameter updating process of L-BFGS model is as follows:


2. Spark implementation of L-BFGS

2.1 Implementation Framework

2.2 Performance Analysis

The algorithm based on Spark l-BFGS has obvious advantages:

HDFS I/O Spark can quickly read and write training data on the HDFS.

For fine-grained load balancing and parallel gradient calculation, Spark has a powerful parallel scheduling mechanism to ensure fast task execution.

Fault tolerance Mechanism When a compute node fails or a task fails, Spark recalculates data based on the DAG link of the RDD. However, for the iterative algorithm, the ACTION operation of RDD should be used in each iteration to interrupt the DAG of RDD to avoid the logic disorder caused by recalculation.

3. Spark on Angel implementation of L-BFGS

3.1 Implementation Framework

Sparkon Angel uses Angel ps-service to introduce the PS role to Spark, reducing the dependency of the entire algorithm process on the driver. Two-loop recursion algorithm to PS, and driver is only responsible for task scheduling, greatly reducing the dependence on driver performance.

Angel PS consists of a group of distributed nodes. Each vector and matrix are divided into multiple partitions and stored on different nodes. Meanwhile, operations between vector and matrix are supported.

3.2 Performance Analysis

The driver is only responsible for scheduling tasks, while the complex two-loop recursion operation runs on PS. The synchronization of gradient Aggregate and model is between executor and PS, and all operations become distributed. In network transmission, the high-dimensional PSVector will be cut into small data blocks and sent to the target node. Such many-to-many transmission between nodes greatly improves the speed of gradient aggregation and model synchronization. In this way, Spark on Angel completely avoids the bottleneck of Spark’s driver single point and the problem of network transmission of high-dimensional vectors.

4. Spark on Angel

Spark on Angel is a “plug-in” designed by Angel to solve Spark’s defects in machine learning model training. It is an independent framework without “intrusive” modifications to Spark. The characteristics of Spark on Angel can be summarized as “light”, “easy”, “strong” and “fast”.

4.1 Light – “plug-in” framework

Spark on Angel is a “plug-in” designed by Angel to solve Spark’s defects in machine learning model training. Spark on Angel does not make intrusive changes to the RDD in Spark. Spark on Angel is a framework that relies on Spark and Angel, and its logic is independent of Spark and Angel. Therefore, Spark users can easily use Spark on Angel by making three changes in the Spark submission script. For details, see the Github Spark on Angel Quick Start document

You can see that the submitted Spark on Angel task is still a Spark task in essence, and the execution process of the whole task is the same as Spark.

source

${Angel_HOME}/bin/spark-on-angel-env.sh



$SPARK_HOME/bin/spark-submit \



–master

yarn-cluster \



–conf

spark.ps.jars=$SONA_ANGEL_JARS \



–conf

spark.ps.instances=20 \



–conf

spark.ps.cores=4 \



–conf

spark.ps.memory=10g \



–jars

$SONA_SPARK_JARS \



….

Spark on Angel is such a lightweight framework because Angel’s encapsulation of Ps-Service enables Spark drivers and executors to interact with AngelPS through PSAgents and PSClient.

4.2 Strong – Powerful and supports the Breeze library

The Breeze library is a machine-learning-oriented numerical library implemented by Scala. Most of Spark MLlib’s numerical optimization algorithms are done by calling Breeze. Both Spark and Spark on Angel implementations are implemented with calls to breeze.optimize.lbfgs, as shown below. Spark’s implementation is BreezePSVector.

BreezePSVector refers to the Vector on Angel PS, which implements the breeze NumericOps methods, such as the common dot, scale, axpy, add operations, Therefore, the high-dimensional vector operation in LBFGS[BreezePSVector] two-loop recursion algorithm is the operation between breezepsVectors, and all the operations between breezepsVectors are distributed on Angel PS.

The L-BFGS implementation of Spark

The Vector generic in Spark on Angel’s L-BFGS implementation interface call changes from DenseVector to BreezePSVector

4.3 Easy – Simple programming interface

Another reason Spark is so popular in big data is that it is easy to program and understand, something Spark on Angel also inherited. Spark on Angel is essentially a Spark task, and the entire code implementation logic is the same as Spark. When you need to perform operations with PSVector, you simply call the corresponding interface.

The following code shows the implementation of LBFGS on Spark and Spark on Angel. The overall idea of the two codes is the same. The main difference is the Aggregate and model of gradient vectors

The pull/push. Therefore, if you transform Spark’s algorithm into the Spark on Angel task, only a small amount of code needs to be changed.

L-bfgs requires the user to implement DiffFunction, and the input parameter of the DiffFunction interface is calculte

, traverse the training data and return Loss and gradient.

For the full code, go to Github SparseLogistic

The DiffFunction implementation of Spark

Spark on Angel’s DiffFunction implementation the Calculate interface input parameter is W, traverses the training data and returns loss and cumGradient. W and cumGradient are BreezePSVector; When you calculate the gradient, you have to take W

The local gradient value needs to be pushed to the cumGradient vector on the remote PS via incrementAndFlush in the PSVector.

4.4 Fast – Strong performance

We implement LR of SGD, LBFGS and OWLQN optimization methods respectively, and do experimental comparison on Spark and Spark on Angel. The experimental code can be found at Github SparselrWithx. scala.

Data set: a data set of Tencent’s internal business, with 230 million samples and 50 million dimensions

Experimental Settings:

Note 1: The resource allocation of the three groups of comparative experiments is as follows. We try our best to ensure that all tasks are executed with sufficient resources, so the configured resources are more than the actual needs.

2: performing the duty of the Spark, need to increase the Spark. Driver. The maxResultSize parameters; This parameter is not required for Spark on Angel.

As shown in the above data, Spark on Angel has more than 50% acceleration in LR model training compared with Spark. The more complex the model, the greater the proportion of acceleration.

5. Conclusion

The emergence of Spark on Angel can overcome the bottleneck of Spark in the field of machine learning with high efficiency and low cost. We will continue to optimize Spark on Angel and improve its performance. We also welcome you to join us on Github.

Github: Angel project. If you like, send us Star on Github.