Recently, a paper on Fluid open source project architecture innovation written by Ali Cloud and nanjing University team was accepted by ICDE 2022, the international top conference on Data Management and Database.
International Conference on Data Engineering (ICDE) is the flagship Conference of the Institute of Electrical and Electronics Engineers (IEEE). SIGMOD, VLDB and SIGMOD are the top three international academic conferences in the field of data management and database, and have been selected into the List of Class A international conferences recommended by China Computer Society (CCF).
Fluid: Dataset Abstraction and Elastic Acceleration for Cloud-native Deep Learning Training Jobs “, To solve the I/O performance challenges of deep learning training operations in cloud native environment, a new data set abstraction and elastic acceleration system architecture are proposed to accelerate data access through a cache engine that automatically optimizes data set characteristics. The authors mainly come from aliyun cloud native team and the Computer department of Nanjing University.
Fluid (github.com/fluid-cloud… Cloud Native Computing Foundation (CNCF) is an open source project of elastic data marshalling and accelerated sandbox. It is jointly initiated by aliyun cloud native team and Nanjing University and maintained with a lot of effort. Its core technical functions include shielding heterogeneous storage data set abstraction, automatic elastic expansion and shrinkage of data cache, and collaborative arrangement of data and applications on the cloud. Since open source in 2020, Fluid project has grown rapidly, with more than 1,000 PR submissions and seven releases, and was officially selected to the Cloud Native Computing Foundation in April 2021, filling a gap in the Kubernetes ecosystem in terms of flexible data caching layouts. In addition, it entered into the native scheduling software layer of international CNCF Panorama open source cloud and was awarded as the Peak open source Project of 2021 OSCAR.
In real production environments, Fluid has helped a large number of users significantly improve the training performance of THEIR AI models and reduce the complexity of managing training data. Aliyun cloud Native team implements and optimises Fluid’s core ideas and design as an important part of the cloud native AI field, and provides services through ACK’s cloud native AI suite products.
In the past few years, Ali Cloud has carried out a series of continuous practice and innovation in the direction of cloud native AI through container service ACK in heterogeneous computing resource management, AI task life cycle management, AI task scheduling and acceleration, AI training data acceleration and other aspects. For AI project creation efficiency, computing resource utilization, AI platform construction speed has brought a breakthrough improvement. In addition to enabling enterprises through various tools and solutions on cloud services, Ali Cloud Native group also reverted the leading cloud native AI technology framework to open source, jointly initiated and maintained the open source project Fluid with its partners, and donated it to CNCF, the cloud native foundation. Currently, 140+ contributors from over 10 well-known companies are working with Fluid community to promote technology innovation and implementation in the domestic cloud native AI field.
The inclusion of this paper in ICDE also represents another result of Ali Cloud’s continuous deep efforts and innovation in the field of cloud native container technology. Before this, Serverless ‘paper related to decentralized fast image distribution technology was accepted by USENIX ATC’ 21. In January 2022, International authoritative consulting firm Forrester released The Forrester WaveTM: Public Cloud Container Platforms, Q1 2022 Ali Cloud has entered the “leader” quadrant of global Public Cloud Container Platforms, which is the first time for Chinese Cloud computing vendors to enter this quadrant.
Attached paper information
Fluid: Dataset Abstraction and Elastic Acceleration for Cloud-native Deep Learning Training Jobs (ICDE 2022)
** Gu Rong, Zhang Kai, Xu Zhihao, Che Yang, Fan Bin, Hou Haojun, DAI Haipeng, YI Li, DING Yu, Chen Guihai, Huang Yihua
** Due to the advantages of containerization and orchestration technology provided by cloud native platform, such as high flexibility, low cost and flexible operation and maintenance, more and more users begin to run deep learning training operations on container cloud platform represented by Kubenetes/Docker technology. However, running deep learning training jobs directly in the cloud native environment often faces I/O performance challenges, including complex data access and tuning, difficulty in dynamically matching GPU I/O requirements, and inefficient sharing of cached data resources across jobs. To solve the above problems, this paper proposes a Fluid based solution: a cloud native deep learning training based data set abstraction and elastic acceleration system. Fluid shields the underlying heterogeneous storage by providing a data abstraction of a Fluid Dataset, and accelerates data access through a cache engine that automatically optimizes Dataset characteristics. Further, Fluid can dynamically adjust the size of the cache space during job training based on changing I/O requirements. Finally, to improve performance for multiple job executions, Fluid can optimize job scheduling execution order based on application semantics across job caches to improve overall execution performance. Experiments in related scenarios show that Fluid can significantly improve the performance of mainstream and industry-leading cloud native scheduling systems without being intrusive to the original system.
Release the latest information of cloud native technology, collect the most complete content of cloud native technology, hold cloud native activities and live broadcast regularly, and release ali products and user best practices. Explore the cloud native technology with you and share the cloud native content you need.
Pay attention to [Alibaba Cloud native] public account, get more cloud native real-time information!