Introduction: Deep learning platform plays an important role in weibo social service. Under the computing and storage separation architecture, weibo deep learning platform has low performance in data access and scheduling. This paper will introduce a set of new Fluid (including JindoRuntime) based architecture scheme implemented by weibo internal design, which significantly improves the performance and stability of model training in massive small file scenarios, and the speed of model training in multi-machine and multi-card distributed training scenarios can be increased by 18 times.
Author | Wu Tong Micro blog deep learning platform engineer Hao Li Micro blog deep learning platform engineer
Introduction: Deep learning platform plays an important role in weibo social business. Under the computing and storage separation architecture, weibo deep learning platform has low performance in data access and scheduling. This paper will introduce a set of new Fluid (including JindoRuntime) based architecture scheme implemented by weibo internal design, which significantly improves the performance and stability of model training in massive small file scenarios, and the speed of model training in multi-machine and multi-card distributed training scenarios can be increased by 18 times.
background
Sina Weibo is The largest social media platform in China, with hundreds of millions of pieces of content generated every day and spread across the trillion-level social network. The following figure is the business ecology map of Weibo. High-quality users produce and disseminate high-quality content, and ordinary users consume these content, and then pay attention to their favorite bloggers, establish connections and form a closed-loop ecology.
Weibo machine learning platform of the main function is to make the whole process flow more efficient fluid: by understanding the high quality content, building user, the user is interested in high quality content to users, let them interact with the content producers, and thus stimulate producers to produce more and better content, realize win-win situation of information consumers and producers. As multimedia content goes mainstream, deep learning becomes even more important. From multimedia content understanding to CTR task optimization, deep learning technology can not be separated from the support.
Large-scale deep learning model training challenges
As deep learning is widely used in microblog business scenarios, microblog deep learning platform plays a very core role. The platform uses the architecture of storage and computing separation, so that computing resources can be decoupled from storage resources, so as to achieve flexible resource ratio, convenient storage expansion, and reduce storage costs.
However, this architecture also presents some challenges, among which the key issues are data access performance and stability:
Compute-storage architecture leads to high latency of data access, resulting in slow training: deep learning tasks (images or speech models) used by business teams access large amounts of small files. Experiments show that the performance difference between HDFS reading massive small files and local reading is nearly 10 times or even 100 times. Kubernetes scheduler data cache is not aware, the same data source repeatedly run access is still slow: the same model, different overparameter; Fine-tuning model, same input; Deep learning tasks like AutoML access the same data repeatedly, producing a cache of data that can be reused. However, due to the native Kubernetes scheduler can not sense the cache, resulting in poor results of application scheduling, cache can not be reused, performance can not be improved. Most deep learning frameworks do not support HDFS, making development difficult: PyTorch, MxNet and other frameworks only support POSIX, and HDFS requires additional development. Therefore, it is necessary to support both POSIX interface in model development stage and HDFS interface in model training stage, introducing the complexity of model code adaptation to different storage. Hadoop distributed File System (HDFS) has become a bottleneck for concurrent data access, posing great stability challenges. Hundreds of GPU machines can concurrently access HDFS cluster during training. Meanwhile, I/O pressure of deep learning training is high, and HDFS service becomes a single point of performance, posing great challenges to HDFS performance and stability. If one task slows down the HDFS, other training tasks will also be affected. Also, if HDFS fails, the entire training cluster is affected. Through monitoring and analysis of Weibo deep learning platform, we find that: on the one hand, expensive computing resources such as GPU cannot be fully utilized due to IO performance problems; On the other hand, we also found that memory and local hard disks in the cluster had low water levels, large margins and stability, as most deep learning tasks did not use local disks and memory utilization was not high. Therefore, we consider that it would be a better solution to speed up data access by making full use of the cluster’s own memory and disk resources.
Fluid + JindoRuntime: Provide efficient support for weibo deep learning platform
In order to better meet the computational requirements of large-scale deep learning model training, better data localization effect should be achieved. Therefore, we hope to achieve the following objectives:
Computing can make full use of localized access to data, so that data does not need to be read repeatedly across the network, accelerating the speed of deep learning model training and improving the GPU utilization of the cluster. Reduce HDFS load. Local data reading reduces data access latency and improves HDFS availability. Taking full advantage of the cache node of hotspot data set, tasks are intelligently scheduled to the data cache node without user awareness. Make common model training programs faster and faster. Reading data through POSIX interfaces eliminates the need to use different data access interfaces during model development and training, reducing the cost of developing deep learning model programs. To achieve this goal, we were eager to find software on Kubernetes with distributed cache acceleration capabilities. Fortunately, we found Fluid, the CNCF Sandbox project, to be the right fit. Therefore, we designed a new architecture solution based on Fluid. After verification and comparison, we chose JindoRuntime as the accelerated runtime.
1. Architecture components
1) Fluid
Fluid[1] is an extensible distributed data orchestration and acceleration system running on Kubernetes. Through data orchestration and application scheduling using data, Fluid[1] solves such pain points as high data access delay, difficulty in joint analysis of multiple data sources and complex data use process of applications running on cloud native orchestration framework.
2) JindoRuntime
JindoRuntimed[2] is an implementation of Fluid, a distributed cache Runtime, based on JindoFS distributed cache acceleration engine. JindoFS is a big data storage optimization engine developed by ali Cloud EMR team, which is fully compatible with Hadoop file system interface and brings customers more flexible and efficient computing and storage solutions. JindoRuntime uses the Cache mode of JindoFS to access and Cache remote files and supports the access and Cache acceleration of storage products such as OSS, HDFS, and standard S3. Using and deploying JindoRuntime on Fluid is simple, compatible with a native K8s environment, and can be used out of the box. In combination with the object storage feature, Navite framework is used to optimize performance and supports data security functions on the cloud, such as security exemption and checksum.
2. Reasons for using Fluid based on JindoRuntime
Fluid can organize data sets into Kubernetes clusters, realize data and computation co-location, and provide Persistent Volume claim-based interfaces to achieve seamless docking of applications on Kubernetes. JindoRuntime also provides access to HDFS data and cache acceleration capabilities, and uses FUSE’s POSIX file system interface to easily use massive files on HDFS as if they were local disks. Deep learning training tools such as PyTorch can read training data using POSIX file interfaces. JindoRuntime optimizes the data organization, management and access performance of small files to provide efficient access performance to small files, which is much higher than that of HDFS. Provides metadata and data distributed hierarchical caching, as well as efficient small file retrieval. Provides a data preheating mechanism to avoid data access contention caused by pulling data at training time. In Slab Allocation mode, file data is organized and cache space is efficiently utilized. Through Fluid’s data-aware scheduling capability, users can place tasks to nodes with cached data without knowing the information of cached nodes, maximizing the advantages of data access performance. It provides different cache policies and storage modes for large and small files, and is highly adaptable to small file AI training scenarios without user configuration.
3. Practice
Select the right cache node: Using JindoRuntime can achieve better local performance of data, and in actual production we found that not all nodes are used for cache performance. The reason is that the disk and network IO performance of some nodes is not very good. At this time, we need to choose some large capacity disks and nodes with good network as far as possible. Fluid supports the schedulability of dataset, in other words, the schedulability of cache nodes. We schedule cache nodes of dataset by specifying the nodeAffinity of dataset, so as to ensure that cache nodes can provide cache services efficiently. Specify Master scheduling policy: JindoRuntime consists of Master, worker, and fuse. The Master is responsible for the brain of the cluster, metadata and cluster cache management, so the Master node must have strong reliability and fault recovery speed. In the production process, we found that a single master has strong stability and fault recovery speed without the use of multiple masters. The important factor affecting the stability of the master node is the stability of the host machine, such as the host machine full disk, communication failure, etc. Based on this, nodeselector was used for mater nodes to select the host with better performance as the master container environment to further ensure the stability of the Master environment. Timed data warm-up: An important step before training is the warm-up of metadata and data. Fluid provides the cache of metadata and data in the form of CRD. Before training, the local cache of metadata and data of training files can greatly speed up training. However, the training files stored in HDFS are updated once a day, so the data preheating process needs to be carried out periodically. Based on CRD of Dataload, periodic scheduling is carried out in the form of cronJob, so that metadata and data can be prepared before each training, so as to carry out efficient training. Of course, JindoRuntime also supports incremental synchronization, so you only need to update the changing files each time, which greatly speeds up data warming.
4. Performance test scheme
In order to verify the overall effect of the above schemes, we verified them from different perspectives of stability and performance. The performance test schemes are mainly introduced here. The training models are all video understanding models based on MMAction, using rawframes_train method, and are training data set experiments with 400W pictures. The data is extracted from 40W video frames extracted from real business scenes, and 10 frames are extracted from each scene. Due to different video definition, the size of each image varies from a few KB to a dozen meters, with a total size of about 780G. Each cache node provides 300G cache space. At the same time, according to experience, the model convergence is generally realized around 50epoch. When we adjusted the test video data volume to 100W, the total data size was 2T. Due to the large data volume and long delay, the HDFS interface could not work at all. Fluid+JindoRuntime can meet business needs. The testing process will be Fluid JindoRuntime for data preheating, followed by model training.
5. Performance test results
Combined with Fluid+JindoRuntime scheme, we achieved a very significant improvement in training speed on the premise of data warm-up, as can be seen from the following figure: In the scenario of 3 machines and 12 cards, we found that the experiment of reading data based on HDFS interface would often be interrupted due to network communication problems, which led to the failure to finish the experiment. After the exception processing was added, the waiting time among workers would be prolonged, so the increase of cards could not increase the training speed, but would slow it down. It can be observed that the overall training speed is basically the same in the scenario where machine 1 has eight cards and machine 3 has 12 cards, and computing resources are expanded. Through the new scheme, we found that compared with HDFS interface, 1 machine with 4 cards can get 5 times of acceleration, 2 machines with 8 cards can get 9 times of acceleration, and 3 machines with 12 cards can get 18 times of acceleration.
Due to the guaranteed speed and stability of the training, the end-to-end model training time has also been significantly improved, and the total training time has been shortened from 389 hours (16 days) to 16 hours.
Bottom line: Jump from two weeks to 16 hours of training
The integration of Fluid+JindoRuntime significantly improves the performance and stability of model training in small file scenarios. In the case of multi-machine and multi-card distributed training, the speed of model training can be increased 18 times. Cut training that used to take two weeks to complete to 16 hours. Shorter training time and lower HDFS pressure also improved the stability of training tasks, increasing the success rate of training from 37.1% to 98.3%. We currently have 4TB of data in production, and it continues to grow as we iterate.
Weibo AI training scenarios have high performance requirements for data reading, and massive small files are also very sensitive to access delay. JindoRuntime’s cache capability can effectively accelerate the cache of data on the big data storage system, and provide stable and reliable data access performance with high throughput and low latency. In addition, it can effectively relieve the pressure on the back-end storage system and ensure the stability of back-end storage. Optimizing reading and caching of small files in specific scenarios not only relieves I/O pressure of HDFS clusters, but also greatly improves training efficiency.
Looking forward to
At present, Fluid+JindoRuntime is more like an assassin’s mapp, which is used to accelerate small file scenarios rather than conventional weapons to accelerate optimization of all data sets. We expect to use flexible data acceleration as the differentiated capability of Weibo deep learning platform to improve overall training task speed and utilization of computing resources. On the other hand, it helps the community evolve and help more developers. To be specific:
Scheduled tasks Dynamic capacity expansion improves data preheating performance and provides a metadata backup mechanism. The ability to quickly rebuild data sets provides performance monitoring console support for high availability of Runtime metadata and mirror upgrade support for full life cycle management of multiple data sets in large-scale K8s clusters thank you Ali Cloud JindoFS The team of Chenshan, Yangli and the container team of Cheyang greatly helped in the whole design and optimization process, with almost no application transformation under the premise of the data acceleration capability of the existing application; It also provides timely and professional support for requirements and problems in test and production environments. A link to the
For more information on Fluid and JindoFS, please refer to the following links:
[1] Fluid:github.com/fluid-cloud…
[2] JindoFS:github.com/aliyun/alib…