The author | | cloud and sourceAlibaba cloud original public account

An era of change often begins with a small innovation.

In 1956, when the trucking tycoon, who knew nothing about shipping, first used containers to transport goods, even He could not have imagined that a simple iron box would trigger a global revolution.

Container size consistent, transport process standard, system. Each box contains only one piece of goods and can be loaded and unloaded freely between ports, trains and steamers. While greatly improving efficiency, it also reduces transportation costs by 90%, essentially breaking the logistics barriers between countries and ports.

This box turned the world into one big factory, promoting global division of labor and resource circulation — Apple found China’s Foxconn, Toyota created “just-in-time production”, Americans ate Brazilian beef… One small innovation has transformed the world’s economic and political landscape.

In IT circles, this box is called a “container.”

Containerized applications, like containers, can load, unload, and operate flexibly in any environment, improving r&d efficiency and greatly reducing operation and maintenance costs, thus creating a new wave of technology.

“Container” has been a hot keyword in cloud computing since Docker emerged in 2013. According to public information released in 2019, the open source Docker has been downloaded more than 80 billion times, and about one-third of the Fortune 100 and one-fifth of the Global 500 use Docker Enterprise edition. Gartner predicts that 75% of enterprises will use containers by 2022.

Unlike following silicon Valley in the past, containerization in China started 10 years ago.

Performance of the crisis

Among them, the earliest layout is Ali.

In 2011, with the popularity of cloud computing, Alibaba went through the era of physical machines and fully moved towards virtual machines.

If the physical machine is the computer at home, the virtual machine is many small computers simulated on the computer. It has complete software and hardware functions, and the use experience is consistent with the computer. But “virtualization” has a performance cost. If a physical capacity of 100 cpus creates 100 small computers, only 90 of them actually work, and the other 10 have to do extra administrative work, resulting in wastage.

It doesn’t matter if the size is small, but alibaba’s tens of thousands of clusters consume more computing power than a mid-sized Internet company in the process of virtualization alone.

In 2011, in order to alleviate the huge loss of virtualization, Taobao’s first programmer CAI Jingxian (alias: Duolong) and the first generation architect Lin Hao (alias: Bi Xuan) inadvertently developed ali’s first generation container — T4.

T4 is also based on physical machines. Whereas a normal virtual machine runs the entire operating system on a virtual hardware platform, thereby providing the runtime environment for applications to run, T4 loads and runs applications directly on the host platform.

Therefore, T4 has the same usage experience as virtual machine, but can reduce performance loss. As soon as it was launched, it was very popular, and gradually replaced virtual machine to assume the computing resources of the whole transaction system of the group.

But it still failed to address Alibaba’s high operating and maintenance costs.

At that time, in order to maintain the stability of the huge cluster, Ali operation and maintenance team more than 300 people, 24 hours shift, still can not keep up with the business volume rising year by year. During the “Double 11” period, the number of users and applications surged, and hundreds of engineers had to manually expand the capacity, building DAMS with human resources so as not to be overwhelmed by the flood peak.

It was also the year that Dotcloud, a little-known company on the other side of the ocean, opened source its container-creation technology Docker.

Containers are not Dotcloud’s innovation, but their open source Docker introduces mirroring to containers.

An image, in short, is a compressed package that contains the application code and all the files and directories it depends on. Once the image is packaged and uploaded to the image library, engineers can simply go to the library and download the image, creating a seamless replica of the previous container, no matter what the environment.

With it, engineers have the magic to quickly build containers in any environment they want.

IT perfectly solves two of the biggest challenges of traditional IT processes — low release success rate and operational stress.

In traditional IT processes, the responsibilities of business, R&D, and o&M are unclear. O&m needs to adjust parameters repeatedly to ensure stable application running.

But with Docker, it’s a different story. While writing the application code, the development engineer will clearly explain the environment on which the application is running, which will naturally improve the success rate of application release and reduce the operation and maintenance pressure. This creates a DevOps mode of work.

Docker flaps his wings, and the tech world is in a container storm. With the mirrored container, IT has become the “container”, with its supporting technology, standards have been online.

Microsoft, Google, Amazon and other giant companies warmly embraced, Dotcloud, a small company on the verge of extinction, became a hot new star, and then directly changed the name of the company to Docker, a great hero, and started to commercialize with Docker.

Ali on the other side of the ocean is no less impressive. In 2015, Lin Hao recruited Lin Xuan to join the team to continuously improve T4. The latter keenly captured the containerized heat wave. He, Lin Hao, CAI Jingxian unanimously decided to upgrade T4 mirror, in order to maintain the advanced technology.

Under Lin Xuan’s flag-waving cry, Yang Yubing (flower name: Shen Ling), Zhang Zhen (flower name: Shou Chen), a xiao and others have joined, becoming the earliest container team. After more than half a year, they successfully upgraded T4 to Alidocker.

A pot of cold water

The container team expects to replace T4 entirely with Alidocker. Wanted to surprise, did not bear to wait for the bench is.

As mentioned earlier, containers, as containers that encapsulate business applications, will change the whole R&D, operation and maintenance link if replaced. Alibaba’s technology system is one of the most complex in the world, and few business units are willing to risk their lives because a single parameter change can cause a malfunction.

But the container team has its own considerations. In a world of iterative technology, even monopolistic products, technologies and companies can suddenly be swept aside by the tide of time,

The sense of crisis is the sword of Damocles for tech companies.

Docker officially became the standard for containers in 2015, but Ali is still using T4.

Can’t wait any longer! Container team with PPT, running to publicize, in order to impress the business team, Lin Xuan introduced Yang Yibing “this is our container prince”, and Zhang Zhen is the “mirror prince”, he also wrote a long post on the Intranet Shouting “Ali people please use Alidocker!”

The operation, with Ali Locker and the princes in mind, remained unmoved. Take the initiative to find, ask are T4, also can’t help enthusiasm, sent a few nameless edge applications.

Fast forward to June 2016, and the team only received over 30 apps. At this time, each business has begun to prepare for double 11, persuading it to use Alidocker is even more difficult.

Lin Xuan was anxious, in a meeting with 100 people, pain Chen mirror of the necessity: “No matter how skilled the bow horse, sooner or later will be replaced with guns, artillery and atomic bombs. If we don’t accept advanced civilization, we will end up like the Qing Dynasty.”

Things came to a deadlock, then Ali GROUP CTO Zhang Jianfeng (alias: Xing Qi) stood out, “I support Alidocker, the online application should be 100% container!”

Zhang Jianfeng

For containerization, Zhang Jianfeng has a complete layout. The advantages of containers are obvious to all. Ali’s business has exploded year by year. After the scale of containers, it is an inevitable trend to run on cloud servers.

The Feitian operating system of Ali Cloud has been able to dispatch tens of thousands of physical machines, but due to virtualization loss, containers still only run on physical machines and cannot enjoy the convenience of cloud access.

In order to create the most suitable base for the container, Zhang Jianfeng asked the elastic computing team led by Zhang Xiantao (name: Xu Qing) to build the Shenlong server at the Double 11 rematch meeting in 2016, overcoming the problem of cloud computing for more than ten years — reducing virtualization loss to zero.

From the upper container to the lower resource base, Ali can build an agile and efficient business operation system.

With the support of Zhang Jianfeng, Ali’s containerization process pressed the “acceleration button”, and a containerization reform across 5 business divisions, 9 teams and 11 business domains began.

The tone-one

The storm caused by containers continues to blow in the science and technology circle. If Docker is to be applied on a large scale, arrangement and scheduling become particularly important.

In 2015, Google’s K8s, Docker’s Swarm, and the open-source community’s Mesos dominated the container choreography market.

In order to avoid the dominance of Docker, Google, RedHat and other open source players jointly established a Foundation named CNCF (Cloud Native Computing Foundation), which is essentially an open source community with K8s as the core.

The container choreography market entered a two-year free-for-all. Ali container process is also in full swing.

Alidocker was unstable at the beginning, with card release, response delay, slow image download, difficult capacity expansion… Every Bug, will bring a business side jump. The container team is on duty in shifts of 7 by 24 hours. In order to highly match with different businesses, they iterate over a dozen versions every week. Yang Yubing wrote “1000 Details of Alidocker” to answer questions.

In September 2016, Double 11 entered the full-link pressure test, and the container team’s every move was under the spotlight.

In the first pressure test, Alidocker failed due to slow release link. In order to solve the problem, Yang Yubing touched the upstream and downstream related systems; Zhang Zhen did mirror image to vomit, until double 11 all transactions can run.

As time approached, even Xiaoyao, CEO of Ali Group, could not help asking: “Bi Xuan, do you think Alidocker is reliable this year?”

There is no turning back arrow, double 11 to the appointment. Zero o ‘clock, the numbers on the big screen whirl. The container team is glued to the screen.

175,000 orders per second, the order creation peak broke the record again. Hundreds of applications, 200,000 containers, 100% core flow, Alidocker successfully withstand, data processing capacity increased 5 times.

Ten thousand heavy mountain landing, container team showed a long time smile.

Break through the bottleneck

The international battle of container choreography market is also gradually clear.

At the end of 2017, Docker announced its support for K8s. K8s won out and became the standard for container choreography platforms. With K8s as a moat, the CNCF community quickly launched a series of well-known tools and projects for container ecology. A large number of companies and startup teams began to develop containerization strategies around CNCF rather than Docker. The community is increasingly prosperous, the appeal of both home and abroad.

The top five cloud vendors such as AWS, Azure, Ali Cloud, Google Cloud and IBM Cloud have all become CNCF members and provided K8s services in their cloud platforms — foreign media said that “they have confirmed that cloud native and container is the future of enterprise computing”.

Pearl jade in front, domestic companies have accelerated the pace of container. Since 2017, Huawei has been internally supporting core services such as Huawei terminal cloud services as containers. A year later, Tencent research business began to cloud, and plans to complete the transformation of cloud native technology.

Alibaba went even further, with its online business becoming fully containerized (Alidocker officially changed its name to PouchContainer) in 2017, increasing the number of containers to one million.

It was Ding Yu (flower name: Shu Tong) who led the team to complete the project. At the beginning of 2017, Ding Yu took over the container team, and he also wrote the all-link pressure test, ali’s key weapon in preparing for double 11.

Ding Yu

Containers on the docks, on their own, can’t make such a big difference. The great thing about McLean is that it has come up with a whole new freight system around containers, including managers, ports, cargo ships, cranes, trucks and a whole new delivery process.

The same is true for the IT industry. Containers are just containers and simple lifting, and supporting facilities such as arrangement and scheduling are also indispensable.

To maximize the container performance, the team upgraded PouchContainer to Alibaba Serverless Infrastructure (ASI) by integrating the entire facility.

After using the ASI container service, users only need to care about the applications in the container, and entrust the container manufacturer with the creation, scheduling, and operation and maintenance control of the container.

In order to enable Ali Group, Ali Cloud and external users to enjoy the same container services, ASI is bound to be compatible with the general standards K8s and Docker, but it is extremely difficult.

“You have to make sure that millions of containers move to ASI in a stable way,” said Huang Tao (name: Zhi Qing), head of the ASI project. “It’s like giving a heart transplant to a 100-meter dash.”

And K8s has a fatal weakness, when the cluster reaches the scale of ten thousand, there will be delays and access denial, can not adapt to tens of thousands of clusters, which is difficult to break the bottleneck of the industry.

Docker also has weaknesses. When creating containers, image downloads are slow and rapid expansion cannot be achieved.

Huang Tao dare not take it lightly, choose 3 people to form a commando, for the big army to explore the way. It took two months, tried everything, and still couldn’t come up with a solution.

Lin Hao went to overseas and brought in Zhang 19s and Li Xiang. Zhang anthologies used to run a pool of native cloud resources at Google. Li Xiang is the author of the ETCD distributed storage system, which has become the industry standard since its introduction. In 2019, Li became one of nine members of CNCF, the first Chinese member in the history of the committee.

One of the major causes of ASI performance bottlenecks is storage bottlenecks on management nodes. In order to break through the storage performance, Ali technical team improved the allocation algorithm of the underlying storage engine of ETCD and increased the storage space from 2GB to 100GB without delay. In addition to storage, the team also expanded the performance of the managed nodes by pre-loading the data required by the nodes and reducing synchronization events.

Meanwhile, PouchContainer has good news. Previously, when creating containers, the image must be pulled completely, and the capacity expansion speed even soared to 10 minutes once a large number of services were encountered. The team developed the second-level mirror technology, so that the image can be loaded on demand, to achieve second-level download expansion.

First from the left in the front row: Huang Tao, second from the right: Zhang Zhen; Back row four left: Yang Yubing, third right: Lin Hao, second right: Lin Xuan

Through two bottlenecks, ASI breaks the industry’s difficulties and realizes the stable operation of nodes with over 10,000 levels. More and more business parties inside and outside Ali access.

The lessons of the blood

Reality always hits hard when you don’t expect it.

“A basic technology team, in the midst of driving technological change, with all eyes on it, creates a big basket case…” . Recalling a big failure, the president called the technical director directly, and the team members were terrified.

After the accident, the business parties that planned to access ASI withdrew one after another. The morale of ASI team was low, and some left under pressure.

After reflecting on the pain, Ding yu posted a self-reflective message on the Intranet: “If the foundation is not solid, the earth will shake. This year, I ran too fast and kept adding new business, ignoring the ability of my team. I hope everyone can remember this bloody lesson and learn from it.”

Immature scheduling, like a naughty child with a “key button”, can have disastrous consequences. Once, the dispatching system misjudged and directly erased an entire container in the machine room. Tens of thousands of containers disappeared in a second, “just like the sky fell in.”

Huang Tao had to start to consider the safeguard measures in the most extreme cases, such as ali clusters are hung, how to recover containers; Or the dispatch system crashed, how to let the business is not affected……

In order to create an impenetrable “container”, they work hard.

Triumph in the skies

At the same time, Ali Group began the road to the cloud with vigour and vitality.

At the beginning of 2019, Zhang Jianfeng, then CTO of Ali Group and president of Ali Cloud Intelligence, called a meeting with ali’s technical backbone. “Starting from this year, Alibaba will no longer purchase physical machines, and all new computing will be on public cloud.”

It was ali’s time to take to the skies. The ASI team decided to put all the containers on the Shenlong server developed by Ali to complete the most critical step — going to the cloud.

Dpca server reduces virtualization costs to almost nothing, can reduce computing costs by 50%, and can improve container performance by 30%. This is truly a cloud server made for containers.

However, PouchContainer carries ali’s physical machine, which has been used for more than 10 years, and runs tens of thousands of applications. To move from the cloud to the cloud, it is difficult for ASI team to cover all aspects. In order to win the cooperation of the business side, Ding Yu took everyone to the first-line technical team to preach how to go up the cloud.

At the same time, in order to truly let Ali Cloud, Ali Group and the open source community enjoy the same container services, to achieve the “trinity”. Li Xiang led the team to integrate with cloud product ACK (Ali Cloud external container service product). The team read the ACK code, closed in ali Cloud flying park for several weeks, “do not come out of xixi!”

The massive project, which took nearly a year and involved 50,000 engineers of Alibaba, withstood the world’s highest traffic peak on November 11, 2019. Alibaba officially announced that its core system is 100% running on AliYun.

ASI has also successfully met with cloud product ACK. Today’s CLOUD product ACK not only retains various capabilities on the cloud, but also successfully deals with the complex business environment of Alibaba Group.

Less well known is that this is also the largest cloud native practice in the world.

The so-called cloud native is to build a set of IT system with the concept of “born on the cloud”. Since the container, the cloud native middleware, database, server and other basic systems have been released successively, jointly building a vast cloud native territory.

Just as the emergence of container created a new shipping system, the emergence of container also opened a new era of cloud computing.

The power of the container

2019 and 2020 are destined to be two exciting years in the history of cloud computing.

Google launched Anthos, a hybrid cloud/multi-cloud management platform centered around K8s; Microsoft launched Open Service Mesh, an open-source cloud native Service network. Huawei Cloud released the second generation of zero-loss bare metal containers; After 2019, Alibaba has implemented the largest cloud native practice more thoroughly in 2020…

Cheng Li (name: Lu Su), CTO of Alibaba Group, said: “Alibaba’s core system has achieved full cloud biodegradation, reduced IT cost per 10,000 peak transactions by 80% compared with four years ago, more than doubled the efficiency of large-scale application delivery, expanded to over one million containers in one hour, and increased elastic scalability by more than 10 times.

The tide of cloud natives rushes in.

Authoritative Gartner predicts that by 2020, 50 percent of traditional and old applications will be transformed in a cloud-native manner, and by 2022, 75 percent of global enterprises will be using cloud-native containerized applications in their production.

No one doubts any more – “cloud native is the future”. At the Cloud Computing Conference in September 2020, Alibaba set up the Cloud native technology Committee. Jiang Jiangwei (name: Xiao Xie), the person in charge, said that millions of enterprises would be empowered to carry out cloud native transformation, improving r&d efficiency by 30% while reducing IT costs by 30%.

Jiang Jiangwei

Cloud native expands the boundaries of commerce and permeates every corner of human activity.

  • China Mobile uses containers instead of virtual machines; Mybank has adjusted more than 400 applications to the cloud native architecture.

  • Jingdong Cloud is also undergoing microservice and container transformation.

  • During the outbreak of the epidemic, Based on ali Cloud container solution, Dingding added 10,000 hosts within 2 hours to support 200 million office workers to work online.

  • Shentong Express moved its core system to the cloud, deployed Ali Cloud containers on a large scale, and the transit of hundreds of millions of packages, with the system as stable as Mount Tai and the IT cost reduced by 30%.

  • Alibaba’s self-developed cloud native technologies, Virtual Cluster and Open Kruise, have been adopted by linkedin, Apple and other companies after Open source in the community.

Five years ago, you would have seen “Docker Basics introduction” and “What is cloud native?” Today, when you search again, “cloud native” is what the industry calls “the next era of cloud computing.”

And it all started with that box.