Based on the native TensorFlow 1.x architecture and interface, the internal deeply customized version of TensorFlow has been deeply optimized from multiple dimensions such as large-scale sparse parameter support, training mode, distributed communication optimization, pipeline optimization, operator optimization and fusion. In the recommendation system scenario, the distributed scalability is improved by more than 10 times, and the unit computing power performance is also significantly improved, which is widely used in the internal business of Meituan. This paper introduces the relevant optimization and practical work.

1 background

TensorFlow (TF) is an open source deep learning framework launched by Google, which is widely used in Meituan recommendation system scenarios. However, TensorFlow’s support for industrial scenarios in the official version is not particularly improved. In the process of mass production, Meituan encountered the following challenges:

  • All parameters are expressed by Variable, which opens up a large amount of memory for sparse parameters over ten billion, resulting in a waste of resources.
  • It only supports distributed extension for hundreds of workers, and has poor scalability for thousands of workers.
  • Online Learning cannot be supported because it does not support dynamic addition and deletion of large-scale sparse parameters and incremental export.
  • When large-scale clusters run, they may encounter slowdowns and outages. Because the framework layer cannot handle it, the task will run abnormally.

These issues are less about TensorFlow’s design and more about the underlying implementation. Considering Meituan compatibility, the use of a large number of business practices and community based on our native TensorFlow 1 x architecture and interface, from large-scale sparse parameter of support, training mode, distributed communication optimization, pipeline optimization, operator and multi-dimensional deals with optimal fusion depth customization, so as to solve the problem of the core of the scene spot.

First of all, in terms of supporting capacity, the new system can achieve nearly linear acceleration of 100 billion parameter models, distributed training of thousands of workers, annual sample data can complete training in one day, and support Online Learning ability. Meanwhile, the new system’s various architectures and interfaces are more friendly, which are used by meituan’s internal business departments, including Meituan Takeout, Meituan Preferred, Meituan Search, advertising platform, dianping Feeds and so on. This paper will focus on the work of large-scale distributed training optimization, hoping to help or inspire you.

2 large-scale training optimization challenges

2.1 Challenges brought by business iteration

With the development of Meituan business, the scale and complexity of recommendation system model are also increasing rapidly, which is shown as follows:

  • Training data: Training samples increased from 10 billion to 100 billion, an increase of nearly 10 times.
  • Sparse parameters: the number increases from several hundred to several thousand, which also increases by nearly 10 times; The total number of ginseng increased from several hundred million to ten billion, increasing by 10~20 times.
  • Model complexity: More and more complex, model single step calculation time increased by more than 10 times.

For heavy traffic business, a training experiment increases from a few hours to several days, and it is a basic requirement for this scenario to keep an experiment within one day.

2.2 System Load analysis

2.2.1 Problem Analysis Tool chain

TensorFlow is a very large open source project with millions of lines of code, and the native system’s monitoring metrics are too coarse and do not support global monitoring, making it difficult to locate complex performance bottlenecks. Based on CAT[2], meituan’s open source monitoring system, we have built a fine-grained monitoring link of TensorFlow (as shown in Figure 1 below), which can accurately locate performance bottlenecks.

At the same time, in the process of performance optimization, it will involve a lot of performance testing and result analysis, which is also a very labor-intensive work. We abstract a set of automated experimental framework (as shown in Figure 2 below), which can automate and multi-turn experiments, and automatically collect various monitoring indicators, and then generate reports.

2.2.2 Load analysis from the business perspective

In the recommendation system scenario, TensorFlow Parameter Server[3] (PS for short) asynchronous training mode is used to support the requirements of business distributed training. What kind of load changes will these business changes bring to this architecture? As shown in Figure 3 below:

In summary, it mainly includes communication pressure, PS concurrent pressure and Worker calculation pressure. For distributed systems, the load problem is usually solved by scaling horizontally. Although the problem seems to be solved, according to the experimental results, when PS is extended to a certain number, the single-step training time will increase, as shown in Figure 4 below:

The core reason for this result is that Worker’s single step training needs to be completed synchronously with all PS communication, and N communication links need to be added for every additional PS, which greatly increases link delay (as shown in Figure 5 below). And there are millions, tens of millions of steps to perform in a single workout. As a result, link latency exceeds the benefits of PS concurrency.

For this system, the core difficulty of optimization lies in how to optimize distributed computing with limited PS instances.

3 optimization Practice

3.1 Introduction to large-scale sparse parameters

For the recommendation system model, most of the parameters are sparse parameters. For sparse parameters, a very important operation is Embedding, which is usually the heaviest load and also the focus of subsequent optimization. Since we have redefined the sparse parameter, the subsequent optimization is also based on this, so we will first introduce the work of this part.

To construct the Embedding module in native TensorFlow, users need to create a Variable that is sufficient to accommodate all sparse parameters, and then learn Embedding on this Variable. However, there are many disadvantages of Embedding training using Variable:

  • The size of Variable must be set in advance. For exascale scenarios, this setting will cause a huge waste of space.
  • The training speed is slow, and the sparse model cannot be customized and optimized.

We first solve the problem of whether there is any Embedding. HashTable is used to replace Variable, sparse feature ID is used as Key, and Embedding vector is used as Value. Compared with native Embedding using Variable, it has the following advantages:

  1. The size of a HashTable can be automatically scaled during training, avoiding redundant storage space and reducing usage costs.
  2. A series of customized optimization has been implemented for HashTable scheme. Compared with Variable, the training speed has been greatly improved, and the training of hundreds of billions of models can be carried out, with good expansibility.
  3. Thanks to the dynamic scaling of sparse parameters, we support Online Learning on this basis.
  4. API design is compatible with the community version, and almost consistent with native Variable in use, with extremely low docking cost.

The simplified version of the IMPLEMENTATION based on PS architecture is shown in Figure 6 below:

The core process can be roughly divided into the following steps:

  1. Sparse feature ids (we usually complete unified coding in advance) enter the Embedding module, and with the help of the send-recv mechanism built by TensorFlow, these sparse feature ids are pulled to the PS end. Operators such as Lookup on the PS side actually query and assemble the Embedding vector from the underlying HashTable.
  2. The Embedding vector is pulled back by the Worker for subsequent training, and the gradient of these parameters is calculated by back propagation. These gradients are further pulled back by the optimizer located at the PS end.
  3. The optimizer on the PS side first calls Find operator to obtain the original sparse parameter vector corresponding to the gradient and the corresponding optimizer parameters from HashTable. Finally, the optimization algorithm completes the calculation of the update of the Embedding vector and optimizer parameters, and then inserts them into HashTable by the Insert operator.

3.2 Distributed Load Balancing Optimization

This part of optimization is the classical optimization direction of distributed computing. PS architecture is a typical “bucket model”. In order to complete a step of training, the Worker terminal needs to interact with all PS, so the balance between PS is very important. However, in practice, we find that the time consumption of multiple PS is not balanced, which is caused by the load imbalance caused by the simple round-robin logic of TensorFlow PS architecture, as well as the imbalance caused by heterogeneous machines.

For the recommendation model, our main optimization strategy is to automatically and evenly slice all sparse parameters and large dense parameters into each PS, which can solve most of these problems. In practice, we also found a problem that was difficult to troubleshoot: the implementation of the native Adam optimizer resulted in unbalanced PS load. More on this below.

In the Adam optimizer, its parameter optimization process requires two β s to participate in the calculation. In the native TensorFlow implementation, these two β s are shared by all Variabl (or HashTable) that need to be optimized by the optimizer and fall on the same PS as the first Variable (name lexicality). This creates a problem: each optimizer has only one β_1 and one β_2, and only on a PS. Therefore, during parameter optimization, this PS will be subjected to much higher requests than other PS, resulting in this PS becoming a performance bottleneck.

However, by observing Adam’s optimization algorithm, we can see that β_1 and β_2 are constant, and the highlighted part in blue is a relatively independent calculation process, and each PS can be completed independently. Based on this finding, the optimization approach is straightforward. We create beta parameters for Adam optimizer redundancy on each PS, and compute t and alpha values locally, eliminating PS hot spots caused by uneven load.

The improvement is universal and effective. In one business model within Meituan, the β hot spot removal resulted in a performance improvement of about 9%. In addition, this optimization also improves the scalability of the PS architecture by getting rid of the global dependence on β, resulting in a better acceleration ratio when scaling the number of workers than before.

3.3 Communication Optimization

As can be seen from the analysis in section 2.2, the communication pressure of the system is also very high. We mainly optimized the communication based on RDMA. Firstly, RDMA is introduced briefly. Compared with the traditional communication process based on socket TCP/IP protocol stack, RDMA has the advantages of zero copy and kernel bypass, which not only reduces the network delay, but also reduces the CPU occupancy. RDMA is more suitable for the related communication process of deep learning model.

RDMA consists of three protocols, Infiniband, RoCE(V1, V2), and iWARP. In the deep learning scenario inside Meituan, THE RDMA communication protocol uses RoCE V2. At present, IN the field of deep learning training, especially in dense model training scenarios (NLP, CV, etc.), RDMA has become the standard of large-scale distributed training. However, in the training of large-scale sparse models, the open source system has very limited support for RDMA. The communication module of TensorFlow Verbs[4] has not been updated for a long time, and the communication effect is not ideal. We have made many improvements based on this.

The optimized version improved performance by 20%~40% in the training of 1TB Click Logs[5], DLRM[6] model and more than 100 workers by exposing data sets. In several business models of Meituan, the communication layer implementation modified by TensorFlow Seastar[7] also has a speed increase of 10%~60%. We also give our work back to the community.

3.3.1 Memory Registration Optimization

RDMA has three data transmission modes: SEND/RECV, WRITE, and READ. WRITE and READ are similar to the data sender directly reading and writing data in remote Memory, but cannot be sensed by Receiver. WRITE and READ are suitable for batch data transmission. Within TensorFlow, RDMA based data transfer uses the WRITE unilateral communication mode.

When RDMA transfers data, it is necessary to create a Memory space in advance and register it with the network card device (MR), so that this space can be directly manipulated by the network card. Creating new memory and registering it with the device takes time. Figure 9 shows the time required to bind different sizes of memory to the nic device. It can be seen that the time required to bind MR increases rapidly as the registered memory increases.

In the community Tensorflow RDMA implementation, the Tensor creation still uses a unified BFC Allocator, and registers all the created Tensor to MR. As mentioned above, MR registration bindings have performance overhead, and high-frequency, large-space MR registrations can lead to significant performance degradation. For the Tensor training, only those involved in cross-node communication need to do MR. The rest of the Tensor doesn’t need to register with MR. Therefore, the optimization approach is straightforward. We identify and manage the communication Tensor, and just register the MR for the Tensor that communicates across nodes.

3.3.2 RDMA static allocator

The RDMA static allocator is an extension of the previous MR registration optimization. By Memory Registration optimization, we reduce the number of MR registers by removing the transport Tensor’s MR registers. However, under large-scale training in sparse scenes, there are often hundreds or thousands of workers in parallel training, which will bring new problems:

  • PS and Worker in PS architecture are client-server to each other. Here, taking PS terminal as an example, when the number of workers increases to thousands, the number of workers increases, resulting in a very high frequency of MR registration on PS terminal, which increases the time consuming of memory allocation registration.
  • Because the shape of the output Tensor of the same operator may change between different steps in sparse scenes, the created MR has poor reusability, resulting in high memory fragmentation and repeated registration MR overhead.

To solve the above problem, we introduce the MR static allocator strategy.

The core design ideas are as follows:

  1. Although sparse scenes have the possibility of changing the Shape of the same operator output Tensor, the overall range of change is controllable. Through monitoring and analysis, we can find a relatively stable memory size, which meets the storage needs of multi-step Tensor.
  2. Based on the above information, we have modified the original MR application strategy of taking Tensor(Request). We apply for a large space at once and register it with the network card. Then we allocate the space through our own allocation strategy, which greatly reduces the frequency of MR application. Only one MR registration application is required during the training process.
  3. We’ve introduced a simple interchange protocol, which packages together the Shape and Data that carry the Tensor and writes it to the Client. The Client interprets the Tensor size based on the protocol, and then reads the Data, so you don’t have to have a lot of negotiation with the Tensor Shape in your native implementation.

In the implementation, we introduce the Allocation Analysis module. At the beginning of training, we will analyze the allocated historical data to get an actual pre-opening MR size and the reserved space of each Tensor. We then pause the training process and start the construction process of the Allocator, including the creation of the MR and the synchronization of the information on both sides of the communication. The Key of the Map is the unique token of the Tensor. The Info structure contains the local address pointer, the offset size, ibV_send_wr, and so on. Then restore the training, and the subsequent Tensor transmission can be sent and received using the static MR, without the multiple negotiation process caused by Shape changes.

3.3.3 Multi RequestBuffer and CQ load Balancing

TensorFlow community RDMA communication includes not only sending and receiving Tensor data, but also sending and receiving control messages. The sending and receiving of control messages also use ibv_post_send and IBv_post_recv primitives. There are some bottlenecks in the implementation of native control flow, which will limit the throughput of control flow in large-scale training, thus affecting the efficiency of data receiving and receiving. Specifically reflected in:

  • Multiple requests depend on this Buffer. As a result, the control flow information is actually sent sequentially. The next Request can be written only after the Ack information from the peer end, limiting the number of requests to be sent and read.
  • On the Client side, you need to poll the RDMA Completion Queue to get the arrival of the request and the associated state changes. The native implementation has only one Completion Queue, and a single thread performs polling processing, which limits the efficiency of response in large-scale distributed training.

Multi RequestBuffer and CQ load balancing optimization were adopted to solve the above problems, which eliminated the throughput bottlenecks in request sending and response.

We Send – Driven & Rendezvous – Bypass

Those familiar with the PS architecture of Tensorflow will know that an asynchronous data exchange mode based on Rendezvous was established to enable data exchange between the two graphs after the whole graph was divided into Worker end and PS end. As shown in Figure 12 below:

Based on the cut diagram logic, the Recv operator means that the Tensor has a need for the Tensor on this side of the calculation, and the producer of the Tensor is behind the Send operator on the other device it is paired with.

In terms of concrete implementation, Tensorflow realizes recv-driven data exchange mode. As shown in the figure above, two computational graphs located in DeviceA and DeviceB will be executed asynchronously and concurrently. Recv located in DeviceB will launch an RPC request to DeviceA when executed. After receiving the request, DeviceA will route the request to Rendezvous. If the required data is found to have been produced and registered by Send operator, the data will be obtained locally and returned to DeviceB. If the data has not been produced yet, the Recv request from DeviceB will be registered in Rendezvous. After subsequent DeviceA is produced, the Send operator will Send it to find the registered Recv and trigger a callback to return data to DeviceB.

We see that the converging point mechanism elegantly solves the problem of data exchange in different producer-consumer rhythms. However, the recv-driven model also introduces two potential problems:

  • According to our observation, in the actual business model, the ratio of the Recv operator waiting for the Send operator in the Rendezvous is the same as that of the Send operator waiting for the Recv operator, that is to say, the data waiting for the Send to Recv can be sent to the other end at the moment when the Send is ready. However, due to the mechanism implementation problems, Or wait for the Recv operator to come, and then pull back the data. The communication process takes a long time.
  • Rendezvous, as a data exchange hotspot, has a high level of internal logical overhead.

To solve the problem mentioned above, we implemented another data exchange mode on RDMA, called send-driven mode. In contrast to the Recv-driven mode, as the name implies, the Send operator directly writes data to the Recv, which receives data and registers it to the local Rendezvous, and the Recv operator directly obtains data from the local Rendezvous. The specific process is shown in Figure 13 below:

As can be seen from the figure, compared with the Recv-driven mode, the communication process of the send-driven mode is greatly simplified. In addition, the feature of sending data immediately after it is ready skives the Rendezvous on one side, and speeds up the data acquisition at the consumer end when the producer precedes the consumer.

3.4 Delay Optimization

This part of optimization is also the classical optimization direction of distributed computing. The whole process link can be streamlined, merged, overlapping need to be constantly excavated. For machine learning systems, compared with other systems, some approximate algorithms can also be used to do this part of the work, so as to achieve greater performance improvement. Here are some optimization practices we did in both areas.

3.4.1 Sparse domain parameter aggregation

After HashTable is enabled to store sparse parameters, correspondingly, some supporting parameters need to be replaced with HashTable implementation, so that multiple Hashtables and a large number of related operators will appear in the whole calculation graph. In practice, we found that the number of Lookup/Insert operators should be reduced as much as possible to reduce the load of PS and RPC QPS on the one hand. Therefore, for the common usage of sparse model, we carry out relevant aggregation work.

Taking the Adam optimizer as an example, slot M and slot V need to be created to store the momentum information in optimization. Its Shape is the same as that of Embedding. In the native optimizer, these two variances are created separately and are read and written during reverse gradient updates. Similarly, with the HashTable scheme, we need to create two separate Hashtables at the same time to train m and V parameters. So in the forward and reverse direction, we need to perform a Lookup and an Insert for Embedding, M and V respectively, and a total of three Lookup and three Insert times are required.

An optimization point here is to aggregate the Embedding, M, V, and low-frequency filtering counters (see Counting HashTable in Figure 14 below) together as the Value of HashTable. In this way, the operations related to sparse parameters can be aggregated and executed, greatly reducing the operation frequency of sparse parameters. Reduced PS pressure.

This feature belongs to a universal optimization. After the aggregation function is enabled, the training speed is significantly improved, and the performance improvement is always positive as the model and Worker scale change. In the real internal business model of Meituan, the performance after aggregation can be improved by about 45% compared with non-aggregation.

3.4.2 Embedding pipeline optimization

Assembly line, in industrial production, refers to a production method in which each production unit only focuses on a certain section of work to improve work efficiency and output. In the field of computer, it is better known that pipeline represents a parallelization technology of Overlap execution between tasks. For example, in a typical RISC processor, the user’s program is composed of a large number of instructions, and the execution of an instruction can be roughly divided into: finger fetching, decoding, execution, access, write back and other links. These links will make use of instruction Cache, data Cache, registers, ALU and other different hardware units. In each instruction cycle, the hardware units of these five links will be executed in parallel, so as to make full use of hardware capabilities and improve the instruction throughput performance of the whole processor. Processor instruction pipeline is a complex and systematic underlying technology, but its ideas are also widely used in distributed deep learning frameworks, such as:

  • If distributed training is simply abstracted into two processes of computation and communication, the overwhelming majority of mainstream deep learning frameworks support communication and computation Overlap when implementing computational graph DAG.
  • If the depth model training is simply divided into forward and reverse, effective parallelization cannot be achieved within a single step due to the strong dependence of the two. The communication scheduling introduced in BytePS[8] breaks the barrier between step Iteration, and the forward calculation of the next round can be started in advance after partial parameters of the last round are updated. Enhance the front reverse Overlap under the overall perspective.
  • Baidu AIBox[9], in order to solve the problem that during GPU training in CTR scenario, the parameters are located in main memory, but the calculation is located in GPU, cleverly schedules different hardware devices and sets up the parameter preparation stage mainly using CPU/ main memory/network card and the network computing stage mainly using GPU/NVLink. Higher training throughput can be achieved through two-stage Overlap.

We can see that in the design of deep learning framework, by analyzing scenes, parallel stages can be explored from different perspectives to improve the overall training throughput.

For large-scale sparse model training, the core model flow is as follows: sparse parameter Embedding is implemented first, and then dense molecular network is implemented. Sparse Embedding is executed on remote PS, which mainly consumes network resources, while dense Embedding is executed on local Worker, which mainly consumes computing resources. These two parts take up most of the time of the whole process, accounting for 40+% and 50+% respectively on a certain actual business model of Meituan.

So can we implement sparse parameter Embedding in advance to achieve communication and computation Overlap and hide this part of time? It is certainly feasible in terms of system implementation, but algorithmically speaking, it introduces the problem of parameter Staleness, which may lead to model accuracy being affected. However, in actual production scenarios, large-scale asynchronous training itself will bring tens to hundreds of steps of lag. After our test, the model accuracy was not affected when sparse parameters were obtained one or two steps in advance.

In terms of concrete implementation, we split the whole calculation Graph into two sub-graphs: Embedding Graph (EG) and Main Graph (MG), which are executed asynchronously and independently, realizing Overlap (the whole splitting process can be transparent to users). EG mainly covers extracting the Embedding Key from the sample, querying and assembling the Embedding vector, updating the Embedding vector and so on. MG mainly includes calculation of dense molecular network, gradient calculation and partial update of dense parameters.

The interaction between the two subgraphs is as follows: EG transfers an Embeding vector to MG (reading values from a dense Variable from MG’s perspective); MG passes the gradient corresponding to the Embedding parameter to EG. The expression of the above two processes is the calculation diagram of TensorFlow. We use two threads and two sessions to execute two calculation diagrams concurrently to Overlap the two stages so as to achieve a larger training throughput.

The diagram above shows the architecture flow diagram of the Embedding pipeline. Intuitively, it can be divided into the sample distribution module on the left, the cross-session data exchange module on the top, and the Embedding Graph and Main Graph obtained by automatic Graph segmentation. The blue circle represents the added operator, the orange arrow represents the EG key flow, and the blue arrow represents the MG key flow. The red arrow represents the sample data focus flow.

  1. An abstraction layer named Pipeline Dataset is introduced in the form of transparency to users. This layer is generated to meet the needs of EG/MG two calculation graphs running at different rhythms, and supports custom configuration. In addition, in order to make the data in the whole production line match each other, a global Batch ID is generated and registered here. The Pipeline Dataset exposes two types of iterators, one for EG and one for MG. The bottom of the Pipeline Dataset shares each layer of the TensorFlow native Dataset.
  2. The Exchange emanager at the top is a static, cross-session data exchange medium that exposes data registration and data pull capabilities. The reason for the abstraction of this module is that EG and MG originally belonged to a calculation graph, but were disassembled into two graphs due to the production line. Therefore, we need to establish a cross-session data exchange mechanism and carry out accurate matching. It internally uses the global Batch ID as the Key, and then manages the data such as sample data, Embeding vector, Index after Embedding gradient and Unique, and is responsible for the life cycle management of these data.
  3. The intermediate Embedding Graph is run in an independent thread by an independent TF Session. After the sampling data is obtained by a operator, the feature ID is extracted and the sparse parameter query based on HashTable is performed. Query results are placed into Exchange emanager by c operator. EG also contains an f operator for reverse update, which gets the Embedding gradient and its accompanying forward parameters from ExchangeManager and then performs the gradient update parameter logic.
  4. The Main Graph below is responsible for the computation of the actual dense subnetwork, and we inherit and implement a trainable EmbeddingVariable, Its construction process (d operator) looks for its matching Embedding vector from ExchangeManager and encapsulates it into EmbeddingVariable, feeding it to the dense subnetwork. In addition, in the inverse approach to the EmbeddingVariable registration, we add an e operator so that the Embedding gradient is added to the ExchangeManager for consumption by the f operator in EG.

Through the above design, we set up a controllable EG/MG concurrent assembly line training mode. Generally speaking, the benefits of Embedding assembly line training mode are as follows:

  • Through our Profiling analysis of several business models, it is found that the time ratio of EG and MG is about 3:7 or 4:6. By combining these two stages in parallel, the Embedding stage can be effectively hidden, so that the MG network computing part can almost always start immediately, greatly accelerating the training processing of the overall model.
  • When multiple optimizers (sparse and non-sparse) are used in TensorFlow engine, the problem of repeated construction of reverse calculation graph will occur, which increases extra calculation to a certain extent. This problem is precisely avoided by splitting two subgraphs.
  • In the implementation process, ExchangeManager not only takes charge of the exchange of Embedding parameters and gradient, but also takes charge of the management of metadata reuse. For example, the result of Unique is saved, which further reduces the double calculation.

In addition, in terms of API design, we make it transparent to users. Only one line of code can enable the Embedding pipeline function, and the cutting process of EG/MG is hidden from users. At present, in meituan business training, the performance of Embedding pipeline can be improved by 20%~60% in CPU PS architecture (and the larger the concurrent scale of Worker, the better performance).

3.5 Single-instance PS Concurrency Optimization

According to the analysis in section 2.2, we cannot continuously expand PS to improve the throughput of distributed tasks, and the concurrent optimization of single-instance PS is also a very important optimization direction. Our main optimization work is as follows.

3.5.1 High-performance HashTable

In the PS architecture, large-scale sparse model training has high requirements for concurrent read and write of HashTable, because each PS bears the pressure of hundreds or even thousands of workers. Here we consider speed and stability. TBB :: conCURRENT_hash_MAP [10] was chosen as the underlying HashTable implementation and wrapped as a new TBBConcurrentHashTable operator. In tests, TBBConcurrentHashTable was 3 times faster than native MutableDenseHashTable training on a scale of 100 billion.

3.5.2 HashTable BucketPool

For large-scale sparse model training, the Embedding HashTable faces a large number of concurrent operations. Through Profiling, we found that frequent dynamic memory applications incur a large performance overhead (even though TensorFlow’s Tensor has a dedicated memory allocator). We optimized the memory management of HashTable based on the idea of memory pooling.

When we initialize the HashTable, we create two bucketpools for the Key and Value, and each pool Malloc a large chunk of memory. Considering that there may be scenarios where keys and values are removed from the HashTable (such as during Online Learning training), the memory used by keys and values removed from the HashTable needs to be reclaimed. Therefore, each BucketPool also has a ReuseQueue to maintain reclaimed memory. Key and Value memory and free allocation are pooled each time a Key and Value is inserted into an internal hash table data structure. In this way, the overhead of sparse memory allocation encountered in large-scale sparse training is reduced, and the overall end-to-end training performance is improved by about 5%.

3.6 unit computational force throughput optimization

According to the analysis in section 2.2, the calculation pressure of workers is also very high. If workers are not optimized and throughput is maintained, more workers need to be horizontally expanded, bringing greater pressure to PS. For users, the service value is higher if performance is improved with limited computing resources. We have calculated some high-frequency operators through CAT and made special optimization. Here, the unique and dynamic Partition operator fusion case is selected to share.

In TensorFlow PS architecture, shared parameters including Embedding vector are stored on PS and interact with Worker through network. The following two links are often involved in Embedding query:

  • Due to the nature of sparse parameters, the repetition rate of the query ID extracted from the sample is usually as high as 70%~90%. If the re-query is not carried out, no matter for HashTable query or network transmission, it will bring great pressure. Therefore, the Unique operation is usually performed before the query.
  • In a large-scale sparse scenario, in order to store hundreds of billions of parameters, there will be multiple PS machines jointly bearing. The Worker server is responsible for dividing query requests according to the set routing rules. DynamicPartition action is usually performed before query.

Usually, these two processes are constructed by using the existing operators of TensorFlow, but in practice, we find that it is not very efficient. The main problems are as follows:

  • The Unique operator is implemented natively, and its internal memory allocation strategy is relatively inefficient. The size of the input parameter twice that of the Embedding ID is used for memory allocation, but as the input parameter is large and the repetition rate is high, HashTable creation is too large and sparse. Almost every insert produces a Minor_page_fault, which degrades HashTable performance. We verified this using Intel Vtune (see Figure 18).
  • Redundant data traversal exists for Unique and Dynamic Partition operators. These operations can be completed in a single data traversal, saving the time of operator switching and redundant data traversal.

In summary, too much HashTable will lead to a large number of minor_page_faults, resulting in an increase in access time. Too small HashTable may lead to capacity expansion. We use the memory adaptive Unique operator based on heuristic algorithm to achieve a relatively reasonable size of HashTable through the statistics of the repetition rate of training history to improve the performance of access. In addition, as for the specific selection of HashTable in Unique operator, Robin HashTable was selected to replace the implementation in native TF after various tests.

Furthermore, we combine the Unique and Partition links of Embedding ID with operators to simplify the logical implementation. After the above optimization, the Unique single operator can achieve 51% acceleration, about 10% performance improvement in the real model end-to-end, and the total number of operators decreased by 4%.

In the whole process of key operator optimization, Lin Lifan, Zhang Xiangze and Gao Ming from Intel company provided a lot of technical support, and we also reuse part of their optimization work. We are deeply grateful!

4. Large-scale sparse algorithm modeling

In the process of business implementation of large-scale sparse capability, the algorithm level needs to be upgraded correspondingly from the characteristics and model structure to achieve very good results. Starting from business characteristics, takeout advertisement introduces large-scale sparse features to upgrade the feature system in takeout scenarios, providing higher-dimensional feature space and parameter space, and enhancing the fitting ability of the model. The feature coding scheme for high-dimensional sparse scenes is redesigned to solve the feature conflict problem in the feature coding process. Meanwhile, redundant feature hashing operations are removed in the coding process, which simplifies the feature processing logic to a certain extent and reduces the time consuming of feature calculation.

At the system level, the training of large-scale sparse models with 10 billion parameters and over 10 billion samples will greatly reduce the efficiency of training iteration, and a single experiment will increase from less than a day to about a week. Meituan machine learning platform for training the engine team, in addition to the above TensorFlow framework level optimization, has also made special for business model optimization, the overall throughput optimized for 8 to 10 times (if put more computing resources, can be further acceleration), enhance iterative efficiency of the business, advertising business power delivery made evident in the ascension.

5 summary and prospect

TensorFlow is widely used in large-scale recommendation systems. However, due to the lack of large-scale sparse large-scale distributed training capabilities, the development of business has been hindered. Based on the native architecture of TensorFlow, Meituan supports large-scale sparse capability, and has been deeply optimized from multiple angles to achieve efficient distributed training of 100 billion parameters and 100 billion samples, and has been used on a large scale inside Meituan. This lack of critical capabilities has also struck a chord with the TensorFlow community, which launched SIG Shameers in 2020 [11] to address these issues through community participation and meituan will be actively involved in community contributions.

At present, model training of Meituan recommendation system scenarios mainly runs on CPU. However, with the development of business, some models become more and more complex, and there is no room for optimization on CPU (the optimized Worker CPU usage exceeds 90%). The next generation of NVIDIA A100 gpus, with 156TFLOPS (TF32 Tensor Cores), 80gb of video memory, and 600GB/s of intercard bandwidth, has improved dramatically in recent years. For Workload of such complex models, we designed the next generation distributed training architecture based on A100 GPU architecture. After preliminary optimization, we also achieved good results in a meituan heavy traffic business recommendation model. We are still in the process of further optimization, and we will share it later, please look forward to it.

6 Introduction to the Author

  • Yifan, Jia Heng, Zheng Shao, Peng Peng, Yongyu, Zhengyang, Huang Jun, et al., from Meituan Basic RESEARCH and development platform, machine learning platform training engine group, mainly responsible for the performance optimization and capacity building of Meituan distributed machine learning training system.
  • Hai Tao, from meituan takeout advertising strategy team, is mainly responsible for algorithm exploration and strategy implementation of Meituan takeout advertising business.

7 Recruitment Information

Meituan machine Learning platform continues to recruit a large number of positions, both social recruitment and school recruitment (welcome to send our school recruitment beidou position: Meituan Machine Learning platform infrastructure), the location of Beijing/Shanghai, to build a multi-field company-level machine learning platform, to help everyone eat better, better life. Resumes can be sent to: [email protected].

8 Reference Information

  • [1] www.usenix.org/system/file…
  • [2] github.com/dianping/ca…
  • [3] www.usenix.org/system/file…
  • [4] github.com/tensorflow/…
  • [5] labs.criteo.com/2013/12/dow…
  • [6] arxiv.org/abs/1906.00…
  • [7] github.com/tensorflow/…
  • [8] github.com/bytedance/b…
  • [9] research.baidu.com/Public/uplo…
  • [10] github.com/oneapi-src/…
  • [11] github.com/tensorflow/…

Read more technical articles from meituan’s technical team

Front end | | algorithm back-end | | | data security operations | iOS | Android | test

| in the public bar menu dialog reply goodies for [2020], [2019] special purchases, goodies for [2018], [2017] special purchases such as keywords, to view Meituan technology team calendar year essay collection.

| this paper Meituan produced by the technical team, the copyright ownership Meituan. You are welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication. Please mark “Content reprinted from Meituan Technical team”. This article shall not be reproduced or used commercially without permission. For any commercial activity, please send an email to [email protected] for authorization.