What deep optimization did Ali Mom do based on TensorFlow? TensorFlowRS architecture analysis

one review

Deep learning has a stronger ability to describe models than traditional logistic regression, and also brings the demand for hundredfold improvement of computing power. Compared with image, voice, video and other fields, search, advertisement, recommendation and other scenes have unique scene characteristics: the sample size and feature space are usually very large. It is not uncommon for hundreds of billions of samples and features, and a large number of sparse features exist as local input. This requires us to design and optimize the deep learning framework according to the computing characteristics in this scenario.

The work described in this paper was completed in cooperation with Ali Mama Basic Platform team and PAI team. Based on TensorFlow, we carried out in-depth optimization and enhancement in search, advertising and recommendation scenarios. The internal project was named TensorFlowRS, and the main achievements are as follows:

(1) It solves the problem of insufficient horizontal expansion ability of native TF. In our tests, the vast majority of search advertising models improved their training performance by more than a factor of 10, and some models improved their ultimate performance by as much as a factor of 100.

(2) Support complete online learning semantics, write model changes in real time; Sparse features do not need continuous ID, and can be directly trained with original feature representation, greatly simplifying the complexity of feature engineering.

(3) Grad-compensation Optimizer for asynchronous training can effectively reduce the training effect loss caused by asynchronous large-scale concurrency.

(4) Integrated Graph Embedding, Memory Network, Cross Media and other advanced training modes.

(5) The model visualization system DeepInSight provides multi-dimensional visualization analysis of deep model training.

two TensorFlowRS distributed architecture

In the process of using TensorFlow, we found that TF, as a distributed training system, has two main problems:

1. Poor horizontal expansion ability: In the performance tests of most models, we found that with the increase of data parallelism, the sample processing QPS of single worker declined sharply. When the number of workers increases to a certain scale, the overall QPS of the system no longer increases or even declines.

2. Lack of complete distributed Failover mechanism: TF builds cluster based on static topology configuration and does not support dynamic networking, which means that when a PS or worker is down and restarted, if IP or port changes (such as machine crash), training cannot continue. In addition, checkpoint of TF only contains the parameter information stored in server, but not the status of worker. It is not a globally consistent checkpoint and cannot realize the basic Failover semantics such as Exactly-Once.

TensorFlowRS ‘solutions to the above problems include:

Improve horizontal scalability by connecting to an independent parameter server

After doing a detailed profiling of TF, we found that TF native PS is difficult to achieve good horizontal scaling capability due to various design and implementation reasons (GRPC, Lock, Graph-engine). So we decided to drop the burden of TF-PS and re-implement a high-performance parameter server: PS-Plus. In addition, we provide a complete TF on PS-Plus solution that allows users to freely switch between Native PS and PS-Plus, and is fully compatible with TensorFlow Graph semantics and all apis. Users can distribute and run parameters on PS-Plus without changing a line of deep network code, enjoying high performance parameter exchange and good horizontal expansion capability.
The Failover mechanism is redesigned to support dynamic networking and Exactly-Once Failover

TensorFlowRS introduces worker state, which stores worker state information in checkpoint. After the worker restarts, the training will continue from the previous progress. In addition, TensorFlowRS generates cluster configuration through ZK and supports Failover in dynamic networking. The new Failover mechanism ensures that when any role fails, the system can complete the Failover at the minute level with little or no data counted

The overall architecture of TensorFlowRS is shown in the figure below:

3. PS-Plus

Compared with traditional ParameterServer, PS-Plus has the following features:

(1) High performance: Through intelligent parameter allocation, zero copy, SEastAR and other technologies, PS-PLUS further improves the service capacity of a single server and the overall horizontal expansion capacity of the system. In the actual measurement, a single server on a 64core machine can easily fill 55+ core, and IO can fill double 25G network cards in the dense scenario. The system as a whole has an approximate linear horizontal expansion capacity within the range of 1~4000 workers

(2) highly flexible: ps-plus has a perfect UDF interface. Users can use SDK to develop customized UDF plug-ins and call them through simple C++ and Python interfaces.

(3) Complete online learning support: PS-Plus supports important features supporting online learning, such as non-ID feature training, dynamic feature addition and deletion, and incremental real-time model export.

The following points are selected for a more detailed introduction:

1. Intelligent parameter allocation

Variable Placement, which determines how a parameter is shard and placed on different servers. The placement strategy has a significant impact on the overall performance of PS under high concurrency. In traditional ParameterServer placement schemes, several common placement algorithms (such as average sharding +roundrobin) are implemented in advance or manually divided by users when creating parameters. The global parameter size and Server load are often not considered comprehensively.

Ps-plus implements a heuristic parameter allocation strategy based on simulated annealing algorithm, and is considering a placement strategy based on runtime load and dynamic rebalance. The placement design of PS-Plus has the following advantages:

Considering the shape information of global parameters, the approximate optimal placement scheme is given under the constraints of CPU, memory, network bandwidth and so on, which avoids the uneven and hot spots caused by manual allocation.
The entire parameter allocation process is automatically completed by the system, enabling users to achieve near-optimal performance without configuration. Users do not need to know the details of the underlying PS implementation.
Partition is automatically completed by the framework. In the upper algorithm code, such as TF code, there is no need to use additional mechanisms such as PartitionedVariable, which is simple and convenient to use.

2. Remove ID feature support

Current mainstream deep learning frameworks store training parameters in continuous memory and address specific weights through offset (ID value). In order to avoid the waste of memory, features need to be encoded continuously starting from 0, which is called feature ID. Feature ID is a very complex process, especially when the number of samples and features is very large, feature ID will occupy a lot of time and machine resources, bringing great complexity to sample construction.

Ps-plus internally implements a customized HashMap, which is specially optimized for parameter exchange scenarios and provides ultra-high performance while supporting dynamic addition and deletion of features. Through hashMap, PS-Plus directly realizes the support of non-ID features, greatly simplifying the complexity of sample construction.

3. Communication layer optimization

For the Parameter Server architecture, latency is an important factor affecting overall performance. Especially when the model complexity is not high, the calculation part of the model is usually in the order of 10~100ms, so the overall communication delay becomes a key factor.

In the traditional pipeline thread model, interrupts and thread context switches under high concurrency can cause high overhead and a large number of cache-line misses. In addition, frequent lock contention is one of the biggest causes of latency, and optimizations such as spinlocks and read-write locks cannot effectively eliminate this problem. We thought polling + run to Completion was the right choice and designed our overall communication layer architecture. In the new communication layer, we use Seastar as the underlying framework. For the connection on the Server and Worker, it is strictly guaranteed that the connection is bound to a fixed thread, and the thread is bound to the CPU core. Request and response are directly processed by the current thread in the form of Run to Completion. The overall architecture is shown in the figure below:

Based on Seastar, we have made many features, performance improvements and optimizations, here are some brief introduction.

External thread interaction queue. We borrowed the interaction mechanism between Seastar cores to provide an M:N lockless producer consumer queue for external threads to interact with threads inside Seastar. Compared with the traditional queue, the performance is greatly improved.
Sequential scheduling of write requests. A write request from an external thread poll to Seastar’s write interface will result in the write buffer not being ordered. Through the transformation of queue mechanism, the write order is automatically guaranteed, and the performance of concurrent write with multiple connections is basically not lost.
Flexible codec layer. We provide a set of abstract interface of codec layer, convenient for users to use, so as to avoid the use of protobuf and other traditional serialization, deserialization of third-party libraries, but also avoid some performance problems of Protobuf.

Four. The performance test

We tested the performance indexes of TensorFlowRS in Dense and WDE(Wide-deep-embedding) classical models:

1. Model Description:

Dense

Batch-size	100
Input-dimension	1130
Hidden-units	256128,64,32,1

Batch-size	100
Deep	Input – dimension: 310 Hidden – units: 256128,64,32,1
Wide	Input – dimension: 0.2 B Output – dimension: 1
Embedding	Input-dimension:0.5B / 7.5B Output-dimension: 8

2. Test Results:

3. Comparison of horizontal expansion ability between Native TF and TFRS under WDE model

5. Online learning

Represented by Ftrl, online learning has been applied on a large scale in the industry in recent years. It is an in-depth combination of engineering and algorithm, and endods the model with the ability to capture online flow changes in real time. It is of great value in some scenarios requiring high timeliness.

The deep model has the same strong demand for online learning as LR model, but the current mainstream deep learning framework lacks support for online learning. TensorFlowRS provides a complete set of end-to-end online learning solutions by docking with PS-Plus, enabling TF to support hundreds of billions of online training of non-ID features.

TFRS has been specifically designed and optimized for online learning scenarios, including:

1. Non-id feature support

It is quite complicated to do feature real-time ID in online learning scenario, which requires a global ID generator with super high performance, which brings great complexity to sample generation. TensorFlowRS uses PS-Plus to directly support non-ID features, greatly simplifying the complexity of real-time sample construction.

2. Dynamic feature additions and deletions

Under the scenario of online training, training mission will run for a long time, in the form of service in the process of training, there will be new features added to the model, in order to ensure that the training can be not long for new features to lead to OOM, PS – Plus the support characteristics of dynamically add at the same time, also provides the default character deletion policy, You can delete low-frequency features or low-weight features. You can also customize deletion policies based on service requirements using the UDF

3. Model incremental real-time export

There are two common ways of online learning model update: full and incremental. In the case of a large number of model parameters, the mode of full update will bring great pressure to the bandwidth of the online system, and reducing the update frequency will reduce the effectiveness of the model. Ps-plus can write incremental parts of the model to the message queue in real time at any frequency, greatly reducing network IO while realizing real real-time model update.

4. AUC Decay

In the case of online learning, we hope to find the abnormality of the model itself as soon as possible in the training process, rather than wait for the model to be updated online. Therefore, we need some methods to evaluate the AUC and other indicators of the model in the training process. The default implementation of streaming AUC in TF cannot timely reflect the state of the current model when a certain amount of historical data is accumulated, and the feedback has a great lag. So a new AUC calculation mechanism was introduced: AUC Decay. AUC Decay was essentially a special Moving Average, which weakened the gravity of historical samples and models in current AUC calculation through time-based reduction, so as to respond model changes more quickly

6. Optimization of convergence effect in large-scale training scenarios

1. Problem elaboration

Distributed parallel training is introduced into the big data model. Synchronous parallel training is restricted by the long-tail worker, and the number of concurrent training is easily limited. Asynchronous parallelism is the mainstream of fast training. Asynchronous parallel training breaks the seriality of ordinary SGD training, and the calculated gradient is not strictly consistent with the updated model, which introduces the problem of gradient delay.

Specifically, in the parameterServer-based training framework, the system is divided into two roles: Worker and PS. Ps is responsible for model block storage and update; The responsibility of worker is to load the latest model obtained from PS, read data for model training, and finally send the learning gradient to PS, so that PS can update the model. Asynchronous concurrent training breaks the seriality of ordinary SGD training and introduces the problem of gradient delay. As the picture shows,Got the model, the gradient is calculatedHowever, when it is finally passed back to PS, it is used in the model on PSOn. Because in theAt the same time of gradient calculation, another R workers submitted gradient updates to PS, and the model on PS had moved forward by R steps. With the modelThe gradient that we computed, used in the modelOn. Although the gradient update of the general direction may not deviate much from the modelDesired gradientIn contrast, the gradientIt’s slightly biased because of the gradientIt’s kind of out of date. This is the origin of gradient delay in asynchronous training.

2. Gradient compensation

Dc-asgd optimizer was proposed by Microsoft in ICML2017, which uses Taylor expansion to approximate gradient compensation. Our tests yielded good results under 50 concurrent sessions. However, in hundreds of concurrent training, Taylor expansion exceeds the approximate radius of convergence, resulting in increased errors and decreased effects.

For hundreds of concurrent training, we introducedG – related factors to boost the mainstream SGD-based Optimizer. withCorrelation with G to measure the severity of gradient delay. On every dimension, ifIs positively correlated with -g, indicating that most workers are updating in the same direction. Model W has gone far in this direction, so it needs to be cautious to continue moving forward. Therefore, we keep the direction of G unchanged, but reduce the absolute value of G. On the contrary, ifIt is negatively correlated with -g, indicating that most workers are updating in the opposite direction. At this time, G is a strong turning signal, revealing that the updating direction of model W will change. We need to pay attention to this signal, so we keep the direction of G unchanged, but increase the absolute value of G.

The introduction of correlation factors is based on the following analysis premises:

(1) In asynchronous training, there is implicit gradient Momentum acceleration, refer to “Asynchrony Begets Momentum, With an Application to Deep Learning”. The greater the concurrency, the greater the implicit Momentum, resulting in excessive gradient forward in one direction.

(2) If the w is not very old, the relevant factor is a turning signal, suggesting that the model moves too far under the momentum accumulation of multiple workers.

(3) With tradeoff, too old W, the signal accuracy will decrease, and then control (reduce) the coefficient lambda.

becauseThe correlation with G is universal, so it can be combined with the mainstream SGD-based Optimizer to meet the concurrent training requirements of different optimizers in different scenarios.

3. Experimental results

We use correlation factor to boost SGD, Momentum and AdaGrad algorithms, and conduct experiments in production environment and public data sets, and the experimental results are as follows

WDE model

parallelism	Boosted-sgd auc	Boosted-moment auc	Boosted-adagrad auc
100	+ 0.012%	+ 0.01%	+ 0.012%
200	+ 0.028%	+ 0.045%	+ 0.051%
400	+ 0.043%	+ 0.064%	+ 0.058%

Cifar10 Alexnet model

parallelism	Boosted-sgd accuracy	Boosted-moment acc	Boosted-adagrad acc
30	+ 0.43%	+ 0.2%	+ 0.25%
60	+ 0.56%	+ 0.25%	+ 0.46%

Eight. Advanced Training Mode

TFRS integrates a variety of higher-order Training modes, such as Graph Embedding, Memory Network, and Cross Media Training. In this article we will briefly introduce, in the future article to do a detailed elaboration.

Graph Embedding is a data structure with strong representation ability, but it cannot be directly used as the input of neural network. TFRS supports sample input in the form of graph, and supports multiple random walk algorithms to dynamically generate positive and negative samples. At present, Graph Embedding has been applied in many projects such as vectorial recall of search through train, and generates sparse features that can be processed by deep neural network through random walk in heterogeneous directed Graph of user-Query-item nodes. Finally, the high-dimensional vectorization representation of User, Query and Item is learned for vectorization recall of online advertising. It is worth mentioning that in addition to Graph Embedding, we also support learning of Graph structure, such as feedback adjustment of edge weight in Graph during training.

Memory Network Was first proposed by Facebook in 2015 as a QA system. Prior to this model, machine learning models lacked components that could read and write external knowledge. This is a strong limitation for many tasks. For example, given a set of facts or stories and then asked to answer questions about that topic, although in principle this can be handled with models such as RNN, their memories (hidden state and weight encoding) are usually too small and do not accurately remember past facts. In the ali Mom search advertising scenario, we use memory network to model user behavior.

Compared with the general method of memory generation in the sample organization stage, TFRS introduces dynamic memory storage module in the training process to support long-term and short-term memory, which greatly improves the training efficiency of serialized class behavior data.

Visual model analysis system DeepInsight

DeepInsight is a deep learning visual quality assessment system that enables full visualization and analysis of model data during training to solve a range of problems, including model evaluation, analysis, and debugging, and improve the interpretability of deep models.

Here’s an overfitting example to illustrate DeepInsight’s role in model quality analysis and problem location:

Above is generated by DeepInsight characteristics weight distribution, we can see from the picture on the right side of a fitting model of edge weights size distribution is very uneven, appeared a lot of weight a great side, and concentrated in a banded area, as a set of characteristics of the input is connected all the edge, this shows that the model fitting the group characteristics of the information. After using regular terms and dropout, the problem of overfitting remains unsolved, so we eventually locate the problem at the input of the set of features.

What deep optimization did Ali Mom do based on TensorFlow? TensorFlowRS architecture analysis

3. Communication layer optimization

Related Posts

A summary of the AI competitions you participated in

Turing Prize winner John Hopcroft: China must improve its undergraduate education to catch up with the US in AI

How to make money in the stock market with Python? I developed stock market kanban in Python