The author | we
Cloud native has become the cornerstone of innovation in digital economy technologies and is profoundly changing the way enterprises access and use the cloud. Cloud native cloud approach can help enterprises maximize cloud value, but also bring a new round of changes to enterprise computing infrastructure, application architecture, organizational culture and RESEARCH and development process. Business and technical challenges have spawned a new generation of cloud native o&M technology systems.
This paper is based on the transcript of the speech delivered by Yi Li, senior technical expert and head of container service r&d of Ali Cloud in the “2021 On Cloud Architecture and Operation Summit” co-hosted by Ali Cloud, sharing the important changes of operation and maintenance technology in the era of cloud native, as well as the CloudOps practice in the development process of alicloud super-scale cloud native applications.
New businesses bring new opportunities and new challenges
Aliyun defines cloud native as the software, hardware and architecture derived from cloud to help enterprises maximize cloud value. The changes brought about by cloud native technologies have several dimensions:
• The first is the change in computing infrastructure, which includes new forms of computing including virtualization, containers, and functional computing to help applications run efficiently in different cloud environments such as public cloud, private cloud, and edge cloud. • Second is the change in application architecture. Microservices, service grid and other technologies are used to help enterprises build modern applications with distributed, loose coupling, high elasticity and high fault tolerance. • Finally, organizational, cultural and process changes. Ideas such as DevOps, DevSecOps, FinOps, AND SRE continue to drive modern software development processes and organizational upgrades.
Looking back at the background of the emergence of cloud native, the emergence of mobile Internet has changed the form of business, changed the way of communication between people, so that anyone can easily obtain the services they need at any time and any place. IT systems need to be able to cope with the rapid growth of the Internet and be able to iterate quickly, with low cost trial and error.
A series of Internet companies represented by Netflix and Alibaba have promoted the reform of the new generation of application architecture, and Spring Cloud, Apache Dubbo and other micro-service architectures have emerged. Microservices architecture solves several problems existing in traditional monomer applications: each service can be deployed and delivered independently, greatly improving business agility; Each service can be independently scaled up to meet the challenges of Internet scale.
Compared with traditional monolithic applications, distributed microservices architecture has faster iteration speed, lower development complexity, and better scalability. At the same time, the complexity of its deployment and operation has greatly increased. How do we respond?
In addition, “pulse” computing became the norm. For example, during the Double Eleven, the computing power required by zero point is dozens of times as much as usual. A breaking news event can send millions of users flocking to social media. Cloud computing is undoubtedly a more economical and efficient way to deal with sudden traffic peak. How to make good use of the cloud and manage the cloud, and how to make applications make better use of the flexibility of infrastructure, become the focus of enterprise operation and maintenance teams. These business and technical challenges have also given rise to CloudOps, a cloud-native o&M technology architecture.
Transformation of operation and maintenance technology in cloud native era
• Standardization: Standardization can promote communication and collaboration between development teams and operations teams. Standardization can also facilitate ecological division of labor and promote the emergence of more automation tools. • Automation: Only automated operation and maintenance can support the challenges of Internet scale and the rapid iteration and stability of business. • Digitalization: Data-oriented and AI-enhanced automated operation and maintenance will become an inevitable trend in the future
In the process of traditional application distribution and deployment, tools are often fragmented due to the lack of standards. For example, Java application and AI application deployment require completely different technology stacks, resulting in low delivery efficiency. In addition, in order to avoid environmental conflicts between applications, we often need to deploy each application on a separate physical machine or virtual machine, which also causes a lot of resource waste.
In 2013, Docker, an open source container technology, emerged and pioneered the application distribution and delivery method based on container images, reshaping the entire life cycle of software development, delivery, operation and maintenance.
Just like the traditional supply chain system, no matter what kind of products are transported through containers, which greatly improves the logistics efficiency and makes the global division of labor and coordination possible.
Container images package applications with their dependent application environments. Images can be distributed through a mirror repository and run consistently in development, test, and production environments.
Container technology is a lightweight OS virtualization capability that improves application deployment density and resource utilization. Compared with traditional virtualization technologies, container technology is more agile, lightweight, flexible, and portable.
Container, as the “application container” in the cloud era, has reshaped the entire software supply chain and opened the tide of cloud native technology.
3. Container technology speeds up immutable infrastructure concepts in traditional software deployment and change processes, often resulting in application unavailability due to differences between environments. For example, a new version of the application relies on the capabilities of JDK11, and if the JDK version is not updated in the deployment environment, the application will fail. “It works on my machine” has also become a catchphrase among developers. And with the passage of time, the configuration of the system has not tested, using the way of upgrading in situ when the change of a careful will fall into the pit.
Immutable Infrastructure is an idea developed by Chad Fowler in 2013. Its core idea is that “once an Infrastructure instance is created, it becomes read-only and can be replaced with a new instance if it needs to be modified or upgraded.”
This pattern reduces the complexity of configuration management and ensures that system configuration changes can be performed reliably and repeatedly. And it can be rolled back quickly in the event of a deployment error. Docker and Kubernetes container technologies are the best ways to implement the Immutable Infrastructure pattern. When we update an image version for a container application, Kubernetes creates a new container, routes the new requests to the new container through load balancing, and then destroys the old container, which avoids the headache of configuration drift.
At present, container image has become the standard of distributed application delivery. Kubernetes has become the standard for distributed resource scheduling.
More and more applications are managed and delivered by means of containers: from stateless Web applications, stateful database and message applications, to data-oriented and intelligent applications.
According to CNCF 2020, 55% of respondents are already running stateful applications in production containers; Gartner predicts that by 2023, 70% of AI tasks will be built in container or Serverless mode.
Comparing the classic Linux operating system and Kubernetes’ conceptual model, they both aim to encapsulate resources downward and support applications upward, provide standardized apis to support application life cycle and improve application portability.
In contrast, Linux’s computing scheduling unit is a process and is limited to a single compute node. The scheduling unit of Kubernetes is Pod, a process group. Its scheduling scope is a distributed cluster, which supports the migration of applications in different environments such as public cloud and private cloud.
For the operations team, Kubernetes became the best platform to implement the CloudOps concept.
The first is K8s’s declarative API, which allows developers to focus on the application itself rather than the details of system implementation. For example, on top of Kubernetes, abstractions are provided for Deployment, StatefulSet, Job, and other different types of application loads. Declarative apis are an important design concept for cloud native, helping to subside system complexity into the hands of the infrastructure for implementation and continuous optimization.
In addition, K8s provides an extensible architecture where all K8s components are implemented and interact based on a consistent, open API. Developers can also provide domain-specific extensions through CRD (Custom Resource Definition) or Operator, which greatly expands the application scenarios of K8s.
Finally, K8s provides platform-independent technical abstractions such as CNI network plug-ins, CSI storage plug-ins, and so on that mask infrastructure differences for upper-layer business applications.
Why Kubernetes?The magic behind Kubernetes’ success is the control loop, and Kubernetes has a few simple concepts.
First, everything is a resource, managed automatically through a controller. Users can declare the target state of a resource. When the controller finds that the current state of the resource is inconsistent with the target state, it will continuously adjust the resource state to the target state. In this way, various situations can be handled in a unified manner, for example, the capacity can be expanded or shrunk according to the number of application copies, or the application can be automatically migrated after the node goes down, and so on.
Because of this, Kubernetes supports resources far beyond container applications. For example, service grid can manage application traffic declaratively. Crossplane can use K8s CRD to manage and abstract ECS, OSS and other cloud resources.
The ideal of K8s controller “keep the complexity for yourself and give the simplicity to the user” is very beautiful, but the realization of an efficient and robust controller is full of technical challenges.
OpenKruise is an open source Cloud Native application automation management engine of Ali Cloud, which is also a sandbox project donated to Cloud Native Computing Foundation (CNCF). It comes from alibaba’s years of containerized, cloud-native technology precipitation, and solves the automation and stability challenges of container applications in mass production environments.
OpenKruise provides enhanced application grayscale publishing, stability protection, Sidecar container expansion and many more capabilities.
The open source implementation of OpenKruise is consistent with the group’s internal version code. It has also been widely used in suning, OPPO, Xiaomi, Lyft and other enterprises. We welcome your community to build and use feedback.
7. GitOps: Declarative apis spawn new ways of application delivery processes and collaborationInfrastructure-as-code (IaC) is a typical declarative API that changes the way resources are managed, configured, and coordinated on the cloud. IaC tools allow us to automate the creation, assembly, and configuration of different cloud resources such as cloud servers, networks, and databases.
The IaC concept can be extended to cover the entire delivery, operation and maintenance process of cloud native software, namely Everything as Code. This chart lists the various models involved in cloud native applications, from infrastructure, to application definition, to application delivery management and security architecture, where application configuration can be managed declaratively.
For example, Istio can be used to declaratively handle application traffic switching, and OPA (Open Policy Agent) can be used to define runtime security policies.
Further, all of your application’s environment configuration can be managed through Git, a source control system, and delivered and changed through an automated process. This is the core idea of GitOps.
First, everything from the application definition to the infrastructure environment is stored in Git as source code. All changes and approvals are also recorded in Git’s historical state. In this way Git becomes sourceof Truth, and we can trace the change history and roll back to the specified version.
Combined with declarative apis and immutable infrastructure, GitOps ensures reproducibility of application environments and improves delivery and management efficiency. GitOps has been widely used in Alibaba Group and is also supported in Alibaba Cloud container service ACK. The GitOps open source community is also evolving tools and best practices, so stay tuned.
Distributed systems are highly complex, and problems anywhere in the application, infrastructure, and deployment process can lead to business system failures. In the face of such uncertain risks, we have two approaches: one is to “resign to fate”, believe in Buddha and do not break down; One is to take the initiative through a systematic approach to improve the certainty of the system.
In 2012, Netflix proposed the concept of “chaos engineering”, which is to proactively inject faults to find the weak links of the system in advance, promote the improvement of the architecture, and finally achieve business resilience. We can compare the way chaos engineering works to vaccines, which train our immune system to fight off “disease” by “inoculating us with an inactivated vaccine.”
The smooth success of Ali Double 11 shopping Festival is inseparable from the large-scale practice of chaos engineering such as full-link pressure measurement, for which Ali team has accumulated rich practical experience in this field.
ChaosBlade is a set of experimental tools following the concept of chaos engineering, featuring rich scenes, simplicity and ease of use, which has become a CNCF sandbox project. It supports Linux, Kubernetes, Docker and other different operating environments, as well as Java, NodeJS, C++, Golang and other languages. More than 200 scenarios are built into the test scheme.
Chaosblade-box is a newly introduced chaos engineering console that enables platform-based management of experimental environments, further simplifying user experience and lowering the threshold of use. Welcome to join Chaosblade and use AHAS cloud service.
Cloud native CloudOps road
Finally, I will introduce some of our explorations on CloudOps in conjunction with the Ali practice.
In traditional organizations, development and operations roles are strictly separated. In addition, different business lines have built a smokestack architecture one by one. From infrastructure environment and operation and maintenance to application operation and maintenance and development, they are all independent teams, lacking good coordination and reuse.
The advent of the cloud is also changing modern IT organizations and processes.
First, public and private clouds become shared infrastructure between different business units.
Then, the concept of SRE (Site ReliabilityEngineering) began to be widely accepted. It is to solve the operation and maintenance complexity and stability of the system through software and automation means. Due to Kubernetes’ advantages of standardization, scalability and portability, more and more enterprises’ SRE teams manage cloud environment based on K8s, which greatly improves enterprise operation and maintenance efficiency and resource efficiency.
On this basis, the platform engineering team began to emerge, based on Kubernetes to build enterprise PaaS platform and CI/CD process, support middleware and different business departments of application deployment and operation and maintenance. Improve the standardization and automation level of the enterprise, and further improve the efficiency of application research and development and delivery.
In this hierarchical structure, the downward team is driven more by SLO, which makes the upper system more predictable to the underlying dependency technology. The higher the team, the more business driven, the better to support the business.
1. Best practices of Ali Cloud container service SRE teamAli Cloud Container Service SRE team has been practicing the best practices of CloudOps, which can be summarized as follows:
• Securityby-design: make the system secure by default, and guarantee the whole life cycle security through the secure software supply chain. • Designfor failure: Control explosion radius, provide current limiting/de-escalation means to reduce fault impact surface. Define the SLO for each production application and establish the relevant observability system, keeping an eye on gold metrics such as request volume, latency, error count, saturation, etc
The second is to build stability emergency system, also known as 1-5-10 quick recovery capability, which includes: • 1 minute discovery – includes black box and white box monitoring capabilities • 5 minutes location – provides diagnostic platters, automated root cause location with tools • 10 minutes stop loss – includes systematic plan design and continuous accumulation, and automated plan execution
Is the last item on the daily security stability, mainly includes: change management, standardization, all do release can be gray, can monitor, can rollback issue tracking routing – have confessed all things, both have landing, do a drill on youth, fault normalized – pressure measurement methods through inspection, raid, leak fill a vacancy, make fault contingency plans for preservation
2. Embrace the cloud native operation and maintenance technology systemCloud native has become an overwhelming technology trend. Gartner predicts that by 2025, 95% of digital operations will be supported by cloud native platforms.
We can choose the right way to move to the cloud according to enterprise capabilities and business objectives, which can be roughly divided into several stages:
• Rehost new hosting: Simply replace offline physical machines with virtual machines or bare metal instances in the cloud by lift-and-shift without changing the original o&M mode. • Re-platform new platform: Use managed cloud services to replace the self-built application infrastructure, such as RDS database service to replace the self-built MySQL, ali Cloud container service ACK to replace the self-built K8s cluster. Hosted cloud services typically offer greater resilience, stability, and autonomy in operations, allowing users to focus on applications rather than infrastructure management. • Refactor/ re-architect/new architecture: including microservice architecture transformation, containerization and Serverless modernization of monolithic applications.
From Rehost, Re-platform, to Re-Architect, we see an increase in the complexity and skill required for migration, but the benefits of agility, resilience, availability, fault tolerance, and more continue to increase.
Alibaba Group has also experienced such a journey.On the basis of 100% public cloud in 2020, 100% cloud biogenics will be applied in 2021. Helped Ali improve the r&d efficiency by 20% and resource utilization by 30%.
Finally, a quick summary. Based on container, Kubernetes and other cloud native technologies, open community standards, imchangeable infrastructure and declarative API will become the best practice of enterprise CloudOps, and will also promote the construction of data-oriented and intelligent system on this basis, further reduce the complexity of operation and maintenance, so that enterprises can focus on their own business innovation. Ali Cloud will also continue to export its ability precipitation in the super-large-scale cloud native practice and exploration, and embrace the cloud native operation and maintenance technology system in a comprehensive way with more enterprises and developers.