KubeCon + CloudNativeCon + OpenSourceSummit China 2021 will be held online on December 9-10, 2021. Yi Li, senior technical expert of Ali Cloud and head of container service r&d, delivered a speech with the theme of “Cloud future, New Possibilities” at the main forum of the conference, sharing ali Cloud’s technical trend judgment and technological innovation progress based on large-scale cloud native practice.

The following is the transcript.

Yi Li, ali Cloud senior technical expert, container service research and development leader

Hello, everyone. I’m Ali Yunyi. I’m in charge of the container service product line and a member of CNCF Governing Board. This is the second time I have spoken to you online in KubeCon. Today, I will share aliyun’s practice and thinking in the field of cloud native, as well as our judgment on the future.

Cloud native – the cornerstone of digital economy technology innovation

Since 2020, COVID-19 has changed the functioning of the global economy and people’s lives. Digital production and lifestyle has become the new normal in the post-epidemic era. Today, cloud computing has become the digital economy infrastructure of society, and cloud native technologies are profoundly changing the way enterprises cloud and use the cloud.

Aliyun defines cloud native as software, hardware and architecture derived from cloud to help enterprises maximize cloud value. Specifically, cloud native technology brings three core business values to enterprises:

  1. Agile and Efficient – Better support for DevOps to improve application development and delivery efficiency, improve resilience and resource utilization. Helping companies better cope with climate change and reduce computing costs.

  2. Enhance resilience – Using container technology to simplify the business cloud and better support microservices application architectures; Further strengthen the resilience of IT enterprise infrastructure and application architecture to ensure business continuity.

  3. Fusion innovation – 5G, AIoT, AR/VR and other new technologies develop rapidly, cloud native technology makes computing ubiquitous, can better support the new fusion computing form.

If cloud native represents cloud computing today, what is the future of cloud computing?

Cloud future, new possibilities

As the power engine of digital economy, the energy consumption growth of data center has become a non-negligible problem in the development of cloud computing. Data centers reportedly consumed more than 2.3 percent of the country’s total electricity in 2020. And the proportion will increase year by year. Alibaba Cloud is making a physical push for green computing, such as using submerged liquid-cooled servers to reduce data center PUE. In addition, the computing efficiency of data centers also has great room for improvement. According to statistics, the average resource utilization rate of data centers around the world is less than 20%, which is a huge waste of resources and energy.

The essence of cloud computing is to aggregate discrete computing power into a larger resource pool, and provide the ultimate energy efficiency ratio by fully cutting peak load and filling valley through optimized resource scheduling.

The new generation of unified resource scheduling boosts green computing

After Ali Group realized the full implementation of cloud, we launched a new plan — to use cloud native technology to conduct unified resource scheduling for ali Group’s server resources of tens of millions of cores distributed in dozens of regions around the world, so as to comprehensively improve utilization rate. Through the efforts of ali Group and Ali Cloud, this year’s Double 11, the unified scheduling project has delivered a shining answer!

Based on the unified scheduler Cybernetes developed by Kubernetes and Ali, intelligent scheduling of the underlying computing resources is carried out through a set of scheduling protocol and a set of system architecture to support the mixed deployment of a variety of workloads and improve the utilization of resources on the premise of ensuring the application of SLO. Let e-commerce micro services, middleware and other applications, search promotion, MaxCompute big data and AI business, all run on the basis of a unified container platform. Ali Group can reduce the procurement of computing power of tens of thousands of servers every year, bringing hundreds of millions of resource cost optimization.

The scale of single cluster is more than ten thousand nodes and one million cores. The task scheduling efficiency reaches 20,000 tasks per second, meeting the requirements of high-throughput and low-latency services such as search, big data and AI, with excellent performance. Unified scheduling helped ali Double 11 to reduce cost by 50% and normalize CPU utilization in production environment by 65%.

Cloud native “Green AI” responds to the challenge of training large AI models

The multi-mode pre-trained AI large model is widely regarded as the critical path towards general artificial intelligence.

The GPT-3, as it is known, has hundreds of billions of parameters that can rival human processing power in some areas of natural language understanding. Alibaba Dharma Courtyard’s newly released super-large pre-training model M6 has entered the era of 10 trillion parameters. M6 has multi-modal Chinese task processing ability, especially good at design, writing, question and answer, and has a wide application prospect in e-commerce, clothing, scientific research and other fields.

Kubernetes’ support for deep learning tasks is maturing. However, large-scale model training still faces severe challenges. Trillion-level parameter model training often requires thousands of Gpus and tens of tons of graphics memory computing resources, and takes dozens of days to complete the training.

To address these challenges, Cybernetes has expanded its large-scale AI task scheduling capabilities from the original Kubernetes base. Through efficient heterogeneous computing force scheduling, data perception and access acceleration, GPU computing efficiency is effectively improved. The idle resources of cluster can be fully utilized by staggered peak scheduling. Efficient parallel model training supported by the cloud-native PAI-WHALE framework.

M6 finally realized the ability to train ten trillion large models in 10 days using only 512 Gpus. Greatly improve the model training efficiency and resource utilization rate. Compared with international models of the same scale, energy consumption is reduced by more than 80%, truly realizing green AI.

Cloud side to side collaboration for ubiquitous computing

As new technologies such as 5G, the Internet of Things and AR/VR continue to mature, the digital world is further integrating with the physical world.

OpenYurt is the industry’s first open source “zero intrusion” cloud native edge computing project, which became the CNCF Sandbox project last November.

Edge computing is faced with such technical challenges as power dispersion, resource heterogeneity and weak network connection. Openyurt builds a cloud side collaborative computing framework based on Kubernetes. In the past two years, it has landed in many industries such as video live broadcasting, cloud games, logistics and transportation, intelligent manufacturing and urban brain.

This year, we hope to realize device twinning in a cloud native way to efficiently solve the challenges of management, operation and maintenance of massive distributed devices in the Internet of Things scenario. Experienced the cooperation between OpenYurt and EdgeX Foundry community, VMWare, Intel and other engineers, to achieve unified modeling and unified management of peer devices and application management. Here I will show you an example of ubiquitous computing using OpenYurt.

Airport operation efficiency is crucial to meet the growing demand of passenger flow and logistics, while airport security challenges are becoming increasingly prominent. In the smart airport project, through the cloud edge and end integration architecture constructed by OpenYurt, the airport perception layer constructed by cameras, sensors and edge AI all-in-one machine is completed, and the global unified management and big data platform is built based on the cloud platform, so as to realize global data sharing and analysis of the airport. Then it realizes the airport panoramic video Mosaic, secure global monitoring, physical visual full field of vision and other capabilities.

Privacy Enhanced computing escort data security

With the rapid development of mobile Internet and Internet of Things, ubiquitous computing generates massive information all the time. How to make infrastructure more reliable and protect private data from theft, tampering and abuse has become an important challenge. With the implementation of the National Data Security Law, privacy enhanced computing business has received more and more attention in the industry.

Gartner predicts that by 2025, 60% of large organizations will be using “privacy-enhanced computing” to process data in untrusted environments or multi-party data analysis use cases.

An important technology branch of privacy-enhanced computing is data protection through TEE, a hardware-based trusted execution environment. The security of TEE is a boundary-based security model, whose security boundary is very small and exists in the hardware chip itself, so that applications executed in TEE do not have to worry about threats from other applications, other tenants, or platforms.

Confidential container technology, which combines container with trusted execution environment, further enhances the protection of sensitive information. On the one hand, the attack surface of container is smaller than that of complete OS. On the other hand, the security software supply chain based on container can guarantee the trusted and traceable source of application.

Inclavare Containers is Ali’s open source, the industry’s first container runtime project for confidential computing. It became a CNCF sandbox project in September this year. The secret container can hide all the complexity of the secret computing underlying system, follow the existing cloud native standardized interfaces and specifications, and be compatible with the existing ecosystem. This will accelerate the adoption of the technology. Working together in the community, we see engineers from the Kata Container community exploring this direction.

As shown in the figure, SGX Confidential Containers supported by the Inclavare Containers project and microVM-based Confidential Containers supported by the Kata Confidential Container project have a high degree of technical morphology similarity. To this end, the developers of the two projects are actively working together to maximize the value of the technology by reusing each other’s technology components and achieving a unified developer experience for the different TEE implementations. This shows the power of the open source community.

From a technical point of view, container images containing sensitive data need to be encrypted and digitally signed in advance compared to runC and Kata container runtimes; The image download process is carried out in TEE to ensure the security of the image decryption process. Relevant keys will be transferred to TEE through the secure and trusted channel established by the remote proof mechanism unique to secret computing to ensure that its contents will not be leaked or tampered with. Finally, the entire secret container runs in hardware-protected TEE at run time, and the data during its computation is encrypted in memory and integrity protected.

The popularization of digital trust through cloud native technology is still an emerging technology field, and we are looking forward to building together!

Accelerate the training of cloud native talent echelon

We believe that the development and popularization of any new technology must be driven by professional talents. As a practitioner and pioneer in the field of cloud native, Ali Cloud attaches great importance to enabling developers through its own experience precipitation. In August this year, Ali Cloud, Linux Open source Software School, CNCF jointly released the “Cloud native talent Training Plan 2.0”, under the joint force of ecology, through the open skill map, professional courses, certification benefits and other ways to jointly cultivate cloud native professionals. We also welcome more developers to embark on the learning path of cloud native together.

Thanks again for watching. We believe that green, ubiquitous, trusted cloud computing will further drive the industry and help us achieve a better tomorrow. Thank you very much!

👇👇 Click here, you can direct ali Yunyun original special!