The author | we Ali cloud source by senior technical experts | alibaba cloud native public number

Articles in this series:

  • Part 1 – Cloud Native Infrastructure

  • Part 2 – Cloud Native Software Architecture

  • Article 3 – Cloud Native Application Delivery and Operation (This article)

2020 has been a year of uncertainty, but also of opportunity. The COVID-19 pandemic has accelerated the digital transformation of the whole society. Cloud computing is no longer a technology, but a critical infrastructure to support the development of the digital economy and business innovation. In the process of using cloud computing to reshape enterprise IT, cloud native technology, which is born in the cloud, grows in the cloud and maximizes the value of cloud, has been recognized by more and more enterprises and become an important means of reducing cost and improving efficiency of enterprise IT.

However, the cloud native revolution is not just about the technical aspects of infrastructure and application architecture, but also driving changes in enterprise IT organizations, processes and cultures.

According to the CNCF 2020 annual survey, 83% of organizations are already using Kubernetes in production environments, but the top three challenges are complexity, culture change, and security.

In order to better accelerate business innovation and solve the challenges of Internet scale, cloud native application architecture and development mode emerge at the historic moment. Compared with traditional single application architecture, distributed micro-service architecture has better, faster iteration speed, lower development complexity, and better scalability and flexibility. However, as in the Star Wars universe, the Force has both a light and a dark side. The complexity of microservice applications in deployment, operation, and management has increased dramatically, with the DevOps culture and the automation tools and platform capabilities behind it becoming key.

DevOps theory had been developing for years before container technology. But organizational and cultural barriers will never be broken down if “development” and “operations” teams do not speak the same language and collaborate with the same technology. The emergence of Docker container technology has realized the standardization of software delivery process, which can be built once and deployed everywhere. The combination of cloud computing programmable infrastructure and Kubernetes declarative apis enables automated continuous integration and continuous delivery of applications and infrastructure through pipelines, greatly accelerating the convergence of development and operations roles.

Cloud native is also a refactoring of team business value and functionality. The transfer of some responsibilities from the traditional operations team to the development team, such as application configuration and release, reduces the labor cost per release, while the operations responsibilities will focus more on system stability and IT governance. SRE Site Reliability Engineering, advocated by Google, is to solve the operation and maintenance complexity and stability of the system through software and automation. In addition, security and cost optimization have become the focus of on-cloud operations.

Security is one of the core concerns of the cloud on the enterprise. The agility and dynamism of cloud native brings new challenges to enterprise security. Since security on the cloud is a shared responsibility model, enterprises need to understand the responsibility boundary between cloud service providers and consider how to solidify security best practices through instrumental and automated processes. In addition, traditional security architectures protect boundaries through firewalls, while any user or service inside is fully trusted. Due to the sudden outbreak of COVID-19 in 2020, a large number of enterprises need remote working and collaboration between employees and customers, and enterprise applications need to be deployed and interacted on IDC and cloud. Cloud security is undergoing a profound transformation after the disappearance of physical security boundaries.

In addition, the COVID-19 pandemic has further made enterprises pay more attention to IT cost optimization. An important advantage of cloud native is to take full advantage of the elastic capabilities of the cloud to provide computing resources required by the business on demand, avoiding resource waste and achieving the goal of cost optimization. However, different from the traditional cost budget review system, the dynamic nature of cloud native and high-density application deployment make IT cost management more complicated.

To this end, cloud native concepts and technologies are evolving to help users continuously reduce potential risks and system complexity. Here’s a look at some of the new trends in cloud native app delivery and operations.

Kubernetes has become the universal, unified cloud control plane

The word Kubernetes comes from the Greek for helmsman or navigator and is the root of cybernetic. Kubernetes has become the de facto standard in container orchestration, not least thanks to Google’s halo and the efforts of CNCF (Cloud Native Computing Foundation). Behind this is Google’s precipitation and systematic thinking in Borg’s field of large-scale distributed resource scheduling and automated operation and maintenance. A careful understanding of Kubernetes architecture design is helpful for thinking about some essential problems of system scheduling and management in distributed systems.

The core of Kubernetes architecture is the controller loop, which is also a typical “negative feedback” control system. When the controller observes that the desired state is inconsistent with the current state, it continuously adjusts resources to make the current state closer to the desired state. For example, you can expand or shrink the capacity based on the number of application copies, and automatically migrate applications when a node is down.

The success of K8s depends on three important architectural choices:

  • Declarative API: On Kubernetes, the developer only needs to define the goal state of the abstract resource, and the controller implements how to achieve it. For example, abstractions of different types of workload resources such as Deployment, StatefulSet, Job, etc. Allows developers to focus on the application itself, not the details of system implementation. Declarative apis are an important design concept for cloud native, and such an architectural approach helps to subside the overall operational complexity to infrastructure implementation and continuous optimization. In addition, due to the inherent stability challenges of distributed systems, the declarative, end-state oriented “level-triggered” implementation can provide a more robust distributed system implementation than the command-based API, event-driven “edge-triggered” approach.

  • Mask the underlying implementation: K8s helps business applications better use the infrastructure through business semantics through a series of abstractions such as Loadbalance Service, Ingress, CNI, and CSI, regardless of the underlying implementation differences.

  • Extensible architecture: All K8s components are implemented and interact based on a consistent, open API. Developers of the three parties can also provide domain-related extension implementation through methods such as CRD (Custom Resource Definition) and Operator, greatly expanding the application scenarios of K8s.

As such, Kubernetes manages resources and infrastructure far beyond container applications. Here are a few examples:

  • Infrastructure Management: Different from open source Terraform or the Infrastructure as Code(IaC) tools provided by cloud providers themselves, such as Ali Cloud ROS and AWS CloudFormation, Crossplane (Crossplane.io /) and AWS Controllers for Kubernetes extend the management and abstraction of infrastructure on top of Kubernetes. This allows K8s applications and cloud infrastructure to be managed and changed in a consistent manner.

  • Virtual machine management: K8s can achieve unified scheduling and management of virtual machines and containers through KubeVirt. It can make up for some limitations of container technology by virtualization. For example, in CI/CD scenarios, it can combine Windows virtual machines for automated testing.

  • IoT device management: Edge container technologies such as KubeEdge and OpenYurt provide the ability to manage a large number of edge devices.

  • K8s cluster management: The node pool management and cluster management of ALIBABA Cloud container service ACK are all automated management, operation and maintenance by Kubernetes. ACK Infra supports tens of thousands of Kubernetes clusters deployed around the world, automating capacity expansion, fault detection/self-healing and more based on K8s.

1. Automatic workload upgrade

The ideal of the K8s controller “keep the complexity to yourself and leave the simplicity to others” is beautiful, but achieving an efficient and robust controller is full of technical challenges.

  • Due to the limitations of the K8s built-in workload, some requirements cannot meet the requirements of enterprise application migration, and extension through the Operator Framework becomes a common solution. But on the one hand, repeating the wheel for repeated needs will lead to a waste of resources; It can also lead to fragmentation of technology and reduced portability.

  • With more and more enterprise IT architectures, from on Kubernetes to in Kubernetes, a large number of CRDS and custom controllers bring a lot of challenges to the stability and performance of Kubernetes. End-state oriented automation is a double-edged sword, bringing declarative deployment capabilities to applications while potentially amplifying misoperations by end-state. Mechanisms such as copy maintenance, version consistency, and cascading deletion in the event of an operational failure are likely to increase the blast radius.

OpenKruise is an open source Cloud Native application automation management engine of Ali Cloud. It is also a Sandbox project currently hosted under Cloud Native Computing Foundation (CNCF). It comes from alibaba’s containerized, cloud-native technology precipitation over the years. It is a standard extension component based on Kubernetes for large-scale application of Alibaba’s internal production environment. It is a set of technical concepts and best practices that closely follow the upstream community standards and adapt to the large-scale scene of the Internet. Open source project OpenKruise and the community to open and build. On the one hand, it helps enterprise customers to avoid detours, reduce technological debris and improve stability in the process of exploring cloud native; On the one hand, the upstream technology community is encouraged to gradually improve and enrich Kubernetes’ application cycle automation capabilities.

For More information: OpenKruise 2021 Exposure: More than Workloads

2. A new collaboration interface for development and operation emerges

The emergence of cloud native technology has also brought about changes in enterprise IT organizational structure. To better respond to the need for business agility, microservices application architectures have spawned “two-pizza teams.” Smaller, independent, self-contained development teams can better reach consensus and accelerate business innovation. The SRE team became the horizontal support team, supporting the upper level r&d efficiency and system stability. With the development of Kubernetes, SRE team can build their own enterprise application platform based on K8s, promote standardization and automation, and let the upper application development team through self-service way for resource management and application life cycle management. We are seeing further organizational changes, with new platform engineering teams emerging.

Reference: blog. Getambassador. IO/the rise – of…

This is also very consistent with the positioning of K8s. Kubernetes technology positioning for application operation and maintenance infrastructure and Platform for Platform, not for developers integrated application Platform. More and more enterprises will build their own PaaS platform based on Kubernetes by platform engineering team to improve r&d efficiency and operation and maintenance efficiency.

Classic PaaS implementations like Cloud Foundry establish a separate conceptual model, technical implementation, and extension mechanism, which provides a simplified user experience but introduces some pitfalls. Unable to combine with the rapidly developing Kubernetes system, unable to fully combine the use of a variety of new technologies, such as Serverless programming model, support AI/ data analysis and other new computing services. However, THE PaaS platform based on K8s lacks unified architecture design and implementation planning, resulting in many fragmented technical implementation, which is not conducive to sustainable development.

Open Application Model (OAM), as well as its Kubernetes implementation KubeVela project, is the standard Model and framework project in the field of cloud native Application delivery and management jointly launched by Ali Cloud and Microsoft and cloud native community. OAM is designed to provide a unified, end-user oriented application definition model for any cloud infrastructure, including Kubernetes. KubeVela is the PaaS reference implementation of this unified model on Kubernetes.

KubeVela/OAM provides kubernetes-oriented service abstraction and service assembly capabilities, which can uniformly abstract and describe the workload and operation and maintenance characteristics of different implementations, and provide plug-in registration and discovery mechanism for dynamic assembly. The platform engineering team can extend new functionality in a consistent manner and maintain good interoperability with the new application framework on Kubernetes. For application development and operations teams, there is a Separation of Concerns that deconstructs application definitions, operations capabilities, and infrastructure to make the application delivery process more efficient, reliable, and automated.

In the field of cloud native application model definition, the industry is also exploring in different directions. For example, AWS’s newly released Proton is a service for cloud native application delivery. With Proton, container and Serverless deployment, operation and maintenance complexity can be reduced, and combined with GitOps, the entire application delivery process can be automated and managed.

Knative supported by Ali Cloud Serverless K8s can support Serverless container and functions at the same time to achieve event-driven applications, allowing developers to use a programming model, which can efficiently choose different underlying Serverless computing power for optimal execution.

Ubiquitous security risks drive security architecture change

1. DevSecOps became the key factor

The combination of agile development and programmable cloud infrastructure greatly improves the efficiency of enterprise application delivery. However, in this process, if the safety risk control is ignored, it may cause huge losses. Gartner concludes that by 2025, 99 percent of security penetration problems in infrastructure on the cloud will be caused by user misconfiguration and management.

In the traditional software development process, security personnel only intervene to conduct security audit after system design and development is completed and before release and delivery. This process does not satisfy the need for rapid business iteration.” Shifting left on Security is beginning to gain more attention, which enables application design, developer collaboration with security teams early on, and seamless embedding of security practices. By moving security left, you can not only reduce security risks, but also reduce repair costs. IBM researchers found that solving security problems in design costs about six times less than during code development and about 15 times less than during testing.

The DevOps R&D collaboration process has expanded to Become DevSecOps. First of all, it is a change of philosophy and culture, in which safety becomes the responsibility of everyone, rather than the responsibility of the dedicated safety team. Secondly, solve the security problem as soon as possible, move the security to the software design stage, reduce the overall security management cost; Finally, risk prevention, continuous monitoring, and timely response capabilities are achieved through automated tool chains rather than man-governed approaches.

The technical premise of DevSecOps implementation is to implement a verifiable and reproducible build and deployment process, which allows us to continuously verify and improve the security of the architecture in different environments, such as test, pre-launch, and production. DevSecOps can be implemented using a combination of immutable infrastructure in cloud native technology and declarative Policy management Policy as Code. The following diagram shows a simplified DevSecOps pipeline for container applications.

When the code is submitted, aliyun image service ACR can actively scan the application and sign the image. When the container service K8s cluster starts to deploy the application, the security policy can verify the application image and reject the application image that fails the verification. Similarly, if we make changes to the Infrastructure in the form of Infrastructure as Code, we can scan the risk before the change through the scan engine, and terminate and alarm the related security risk if it is found.

In addition, once the application is deployed to production, any changes need to go through the automated process described above. In this way, security risks caused by human misconfiguration are minimized. Gartner predicts that 60% of enterprises will adopt DevSecOps and immutable infrastructure practices by 2025, reducing security incidents by 70% compared to 2020.

2. Service grid accelerates the implementation of zero-trust security architecture

The deployment and management complexity of distributed micro-service applications is increased, and the security attack surface is also enlarged. In a traditional three-tier architecture, security protection is mainly for the north-south traffic, while in a microservice architecture, east-west traffic protection is more challenging. Under traditional border protection, if an application is compromised due to a security flaw, there is no security control mechanism to prevent the internal threat from “moving sideways”.

www.nist.gov/blogs/takin…

“Zero trust” was first proposed by Forrester around 2010. To put it simply, zero trust is the assumption that all threats are possible, that no one/device/application inside or outside the network is trusted, and that the trust foundation of access control needs to be reconstructed based on authentication and authorization. Guide security architecture from “network centric” to “identity centric”; Distrust traditional network boundary protection, and use microboundary protection instead.

Google is pushing cloud native security and zero-trust architectures, such as BeyondProd methodology. Alibaba and Ant Group also began to introduce the concept and practice of zero-trust architecture in the process of ascending the cloud. The key is:

  • Unified Identity System: Provide an independent identity for each service component in the microservices architecture.

  • Unified access authorization model: Inter-service invocations need to be authenticated by identity.

  • Unified access control policy: Access control of all services is centrally managed and centrally controlled by standardizing directions.

Security architecture is a cross-cutting concern that cuts across concerns related to all components of the entire IT architecture. If it is coupled to a specific microservice framework, any security architecture adjustments can recompile and deploy every application service, and implementers of microservices can bypass the security architecture. The service grid can provide a loosely-coupled, distributed zero-trust security architecture independent of application implementation.

Here is the security architecture of the Istio service grid:

Among them:

  • You can use existing identity services to provide identity identifiers and also support identity identifiers in SPIFFE format. Identifiers can be passed in X.509 certificates or JWT format.

  • Security policies such as authentication, authorization and service naming are managed uniformly through the SERVICE grid control plane API.

  • Enforcing security policies through an Envoy Sidecar or border proxy server as a policy enforcement point (PEP) can provide secure access control for east-west and north-south service access. And while Sidecar provides an application-level firewall for each microservice, network differentials minimize the security attack surface.

The service grid decouples network security architecture from applications, allows independent evolution and management, and improves security compliance. In addition, the telemetry capability of service invocation can be further used to carry out risk analysis and automatic defense of communication traffic between services through data-based and intelligent methods. Cloud native zero-trust security is in its early days, and we expect more security capabilities to sink into the infrastructure in the future.

A new generation of software delivery is emerging

1. From Infrastructure as Code to Everything as Code

Infrastructure-as-code (IaC) is a typical declarative API that changes the way enterprise IT architectures are managed, configured, and coordinated on the cloud. IaC tools allow us to fully automate the creation, configuration, and assembly of cloud resources such as cloud servers, networks, and databases.

We can extend the IaC concept to cover the entire cloud native software delivery, operation and maintenance process, namely Everything as Code. The following figure covers a variety of models in the application environment, from infrastructure to application model definitions to global delivery and security systems. Application configurations can be created, managed, and changed declaratively.

In this way, we can provide flexible, robust, automated full lifecycle management capabilities for distributed cloud native applications:

  • All configurations are versioned, traceable, and auditable.
  • All configurations are maintainable, testable, understandable, and collaborable.
  • All configurations can be statically analyzed to ensure the predictability of changes.
  • All configurations can be reproduced in different environments, and all environment differences need to be displayed to improve consistency.

2. Declarative CI/CD practices are gaining attention

Further, all of the application’s environment configuration can be managed through source control and delivered and changed end-state through an automated process, which is the core concept of GitOps.

GitOps was originally proposed by Alexis Richardson of Weaveworks with the goal of providing a uniform set of best practices for deploying, managing, and monitoring applications. In GitOps, all environment information, from application definition to infrastructure configuration, is treated as source code and versioned through Git. All releases, approvals, and changes are recorded in Git’s historical state. This makes Git the source of Truth, where we can trace historical changes efficiently and easily roll back to the specified version. By combining GitOps with the declarative API and immutable infrastructure advocated by Kubernetes, we can ensure the repeatability of the same configuration and avoid the unpredictable stability risk of the online environment due to configuration drift.

Combined with the DevSecOps automation process mentioned above, we can provide a consistent testing and pre-release environment before the business goes live, capturing stability risks in the system earlier and faster, and verifying grayscale and rollback measures more fully.

GitOps improves delivery efficiency, improves the developer experience, and improves the stability of distributed application delivery.

GitOps has been widely used in Alibaba Group and Ant in the past two years, becoming a standardized delivery method for cloud native applications. GitOps is still in its early days, and the open source community is still refining tools and best practices. In 2020, Weaveworks’ Flagger program will be incorporated into Flux, enabling developers to implement progressive delivery strategies such as grayscale releases, blue-green releases, and A/B testing through GitOps to control the explosion radius of releases and improve the stability of releases. At the end of 2020, CNCF Application Delivery Field Team officially announced the establishment of GitOps Working Group. We expect the community to further promote standardization process and technology implementation in related fields in the future.

3. The operation and maintenance system has evolved from standardization and automation to data-oriented and intelligent

With the development of microservice application scale, the complexity of problem location and performance optimization has exploded. Although enterprises in the field of IT service management have a variety of tool sets, such as log analysis, performance monitoring, configuration management and so on. But the different management systems are islands of data that do not provide the end-to-end visibility necessary for complex problem diagnosis. Many existing tools take a rules-based approach to monitoring and alerting. In an increasingly complex and dynamic cloud native environment, a rules-based approach is too fragile, costly to maintain and difficult to scale.

AIOps uses technologies such as big data analytics and machine learning to automate IT operations. AIOps can obtain visibility of internal and external dependencies of IT systems, enhance foresight and insight into problems, and achieve autonomous operation and maintenance through massive log and performance data processing and analysis of system environment configuration.

Thanks to the development of cloud native technology ecology, AIOps and Kubernetes and other technologies will promote each other to further improve enterprise IT cost optimization, fault detection and cluster optimization programs. There are several important enablers:

  • Standardization of observability: With the development of cloud native technology community Prometheus, OpenTelemetry, OpenMetrics and other projects, the application observability field is further standardized and integrated in logging, monitoring, link tracking and other fields, making the data set of multi-indicator and root cause analysis richer. The non-intrusive data telemetry capabilities of Service Mesh enable richer business metrics without modifying existing applications. So as to improve the accuracy and coverage of AIOPS.

  • Standardization of application delivery management capabilities: Kubernetes declarative API and end-state oriented application delivery approach provide a more consistent management operation and maintenance experience. Service Mesh’s non-intrusive Service traffic management capabilities allow applications to be managed and operated in a transparent manner.

The combination of Alibaba Group’s DevOps platform “Cloud Effect” and container platform release change system enables the “unattended release” of applications. During the release process, the system continuously collects indicators including system data, log data, and service data, and compares the changes of indicators before and after the release using algorithms. Once problems are identified, the publishing process can be blocked or even automated rollback. With this technology, any development team can safely do a good job of releasing without worrying about major failures caused by online changes.

Cloud native cost optimization is getting more and more attention

As enterprises move more of their core business from data centers to the cloud, more and more enterprises urgently need to budget, cost accounting and cost optimization for the cloud environment. Moving from a fixed finance cost model to a variable, pay-as-you-go cloud finance model is an important conceptual and technological shift. However, the majority of enterprises do not yet have a clear understanding and technical approach to cloud financial management. In the FinOps 2020 survey, nearly half of respondents (49%) have little or no automated means to manage cloud expenditures. To help organizations better understand cloud costs and IT benefits, the FinOps concept became popular.

FinOps is a cloud financial management approach. IT is a transformation of enterprise IT operation model, with the goal of improving organizations’ understanding of cloud costs and making better decisions. In August 2020, the Linux Foundation announced the creation of the FinOps Foundation to advance the discipline of cloud financial management through best practices, education, and standards. At present, cloud vendors are gradually increasing their support for FinOps to help enterprises’ financial processes better adapt to the variability and dynamics of cloud resources. For example, AWS Cost Explorer and Ali Cloud Expense Center can help enterprises better conduct Cost analysis and allocation. See: developer.aliyun.com/article/772… .

More and more enterprises are using Kubernetes platform to manage and use infrastructure resources in the cloud. Using containers to increase deployment density and application resiliency reduces overall computing costs. But the dynamic nature of Kubernetes introduces new complexity challenges for resource metering and cost allocation. Because multiple containers can be dynamically deployed on the same VIRTUAL machine instance and can be flexibly scaled on demand, we cannot simply match the underlying cloud resources to container applications one by one. In November 2020, CNCF Foundation and FinOps Foundation released a new white paper on Kubernetes cloud financial management called “FinOps for Kubernetes: Unpacking Container Cost Allocation and Optimization to help you better understand related financial management practices.

Ali Cloud Container service has also built many best practices of cost management and optimization into the product. Many customers are very concerned about how to achieve cost optimization based on Kubernetes and resource elasticity. Generally, we recommend enterprises to better understand their business types, divide K8s cluster into different node pools, and find a balance between cost, stability, performance and other multidimensional considerations.

  • Daily services: For predictable and relatively constant loads, we can use bare metal or large VMS to improve resource utilization and reduce costs.

  • Short-term or periodic business within the plan: for example, short-term business peak such as double Eleven Promotion, New Year’s Eve activities, or periodic business load change such as end-of-month settlement, we can use virtual machines or elastic container instances to cope with business peak.

  • Unexpected elastic business: such as breaking news or temporary computing tasks. Elastic container instances can easily scale up to thousands of instances per minute.

For more information about Kubernetes planning, please refer to “The Soul of Kubernetes Planning”.

conclusion

In the past decade, the convergence of cloud in infrastructure, upgrading of Internet application architecture and agile R&D process, combined with technological innovations such as container, Serverless and service grid, has jointly spawned the birth and development of the concept of cloud native. Cloud native is redefining computing infrastructure, application architectures, and organizational processes as part of the history of cloud computing. Thanks to all the fellow travelers in the cloud native era, let’s explore and define the future of cloud native together.

The post-credits Easter egg, the title of the three articles in the series, pays homage to the Star Wars franchise. Did you spot it?

Recruiting team

Ali cloud container service team recruitment! Welcome to transfer, recommend! Create the future of cloud native passion together! There are opportunities in Hangzhou, Beijing and Shenzhen. Resume should be sent to: [email protected].

Click on the free download of “Cloud Native Large-scale Application Implementation Guide”, from the technical system upgrade, to technical ability breakthrough, and then to promote business practice, a book to understand the core system of Alibaba Double 11 cloud biochemical process!