The introduction

As Kubernetes becomes more and more popular and complex in production environments, the challenge of stability assurance becomes more and more serious.

How to build a comprehensive and in-depth observability architecture and system is one of the key factors to improve system stability. ACK will be observability best practice precipitation, to Ali Cloud product functions of the ability to reveal to users, observability tools and services become infrastructure, enabling and help users to use product functions, improve user Kubernetes cluster stability and use experience.

This article will introduce the construction of Kubernetes observables system and the best practices of implementing The construction of Kubernetes observables system based on Ali Cloud products.

Observability architecture for Kubernetes systems

The observability challenges of the Kubernetes system include:

  • ** Complexity of the K8s system architecture. ** The system includes a control plane and a data plane, each containing a plurality of components that communicate with each other. The control plane and data are bridged and aggregated by Kube-Apiserver.
  • ** dynamic. **Pod, Service and other resources dynamically create and allocate IP, Pod reconstruction will also allocate new resources and IP, which needs to obtain monitoring objects based on dynamic Service discovery.
  • ** Microservices architecture. ** Applications are decomposed into multiple components based on the microservices architecture, and the number of copies of each component can be controlled automatically or manually depending on the elasticity.

In view of the challenge of Kubernetes system observability, especially in the case of rapid growth of cluster scale, efficient and reliable Kubernetes system observability, is the cornerstone of system stability guarantee.

So how to improve the observability of Kubernetes system in construction production environment?

The observability scheme of Kubernetes system includes indicators, logs, link tracking, K8s Event events, NPD framework and so on. Each approach provides insight into the state and data of the Kubernetes system from different dimensions. In the production environment, we usually need to use a variety of ways, and sometimes use a variety of ways to observe, to form a perfect three-dimensional observable system, improve the coverage of various scenes, and then improve the overall stability of the Kubernetes system. The observability solution for the K8s system in production is outlined below.

Metrics

Prometheus is an open source system monitoring and alerting framework inspired by Google’s Borgmon monitoring system, which is the industry’s de facto standard for metrics data collection solutions. Prometheus was created by former Google employees at SoundCloud in 2012 and developed as a community open source project. In 2015, the project was officially released. In 2016, Prometheus joined CNCF Cloud Native Computing Foundation.

Prometheus has the following features:

  • Multidimensional data model (Key and Value pairs based on time series)
  • PromQL, a flexible query and aggregation language
  • Provides local storage and distributed storage
  • Time series data are collected through the HTTP-based Pull model
  • The Push mode can be implemented using Pushgateway (an optional middleware for Prometheus)
  • The target machine can be discovered through dynamic service discovery or static configuration
  • Support a variety of charts and data platters

Prometheus periodically collects metric data from components exposed to /metrics at HTTP(S) endpoints and stores it in TSDB for promQL-based queries and aggregation.

Indicators in the Kubernetes scenario can be classified from the following perspectives:

1. Basic container resource indicators

The collection source is kubelet’s built-in cAdvisor, which provides indicators related to container memory, CPU, network, and file system. Indicator examples include:

Current memory usage of the container container_memory_usage_bytes;

Number of bytes received by container network container_NETWORK_receive_bytes_total;

Number of bytes container network sends container_network_transmit_bytes_total, etc.

2.Kubernetes node resource indicators

Node_exporter provides node system and hardware indicators. Indicators are as follows: Node_memory_MemTotal_bytes, node_filesystem_size_bytes, node network interface ID node_network_iface_id, and so on. Based on these indicators, you can collect node-level indicators such as CPU, memory, and disk usage of nodes.

3.Kubernetes resource indicators

The collection source is Kube-state-metrics, which is based on Kubernetes API object to generate indicators and provide K8s cluster resource indicators, such as Node, ConfigMap, Deployment, DaemonSet and other types. Node type indicators include kube_node_status_condition, Node information kube_node_info, and Ready Node status indicators.

4.Kubernetes component indicators

Kubernetes system component indicators. For example, kube-controller-manager, kube-Apiserver, kuBE-Scheduler, kubelet, kube-proxy, coreDNS, etc.

Kubernetes O&M component indicators. Observables include blackbox_operator, which implements user-defined exploration rules; Gpu_exporter implements GPU resource sharing capabilities.

Kubernetes service application indicators. Include specific business Pod metrics in the /metrics path for external query and aggregation.

In addition to the above indicators, K8s provides a monitoring interface standard for External indicators through API, including Resource Metrics, Custom Metrics and External Metrics.

The Resource Metrics class corresponds to the interface metrics.k8s. IO. The main implementation is metrics-server, which provides Resource monitoring, including node level, POD level and namespace level. These metrics can be accessed directly via Kubectl Top or via K8s Controllers such as HPA(Horizontal Pod Autoscaler). The system architecture and access links are as follows:

The corresponding API of Custom Metrics is Custom.metrics. K8s. IO, the main implementation of which is Prometheus. It provides resource monitoring and custom monitoring, resource monitoring and the above resource monitoring is actually covered, and this custom monitoring refers to: for example, the application above wants to expose a similar number of people online, or call the MySQL database behind the slow query. These can be defined at the application layer, and metrics are exposed through the standard Prometheus client, which is then captured by Prometheus.

And such interfaces, once collected, can be consumed using standards such as custom.metrics. K8s. IO, which means that Prometheus, You can then use the custom.metrics. K8s. IO interface for HPA, data consumption. The system architecture and access links are as follows:

External Metrics. Because we know that K8s has now become an implementation standard for cloud native interfaces. A lot of times you are dealing with cloud services in the cloud, for example in an application where the message queue is in front and the RBS database is in the back. Then sometimes when data consumption is carried out, some monitoring indicators of cloud products need to be consumed, such as the number of messages in the message queue, or the number of connections in the access layer SLB, the number of 200 requests in the upper layer of SLB, etc., these monitoring indicators.

So how to consume? One of the standards implemented in K8s is external.metrics.k8s.io. The main implementation vendor is the provider of each cloud vendor, through this provider can pass the monitoring indicators of cloud resources. An implementation of external.metrics. K8s. IO used by Alibaba Cloud Metrics Adapter to provide this standard is also implemented on Ali Cloud.

Logging

In summary, these include:

  • Host kernel logs. Host kernel logs can help developers diagnose network stack anomalies, driver anomalies, file system anomalies, and node (kernel) stability anomalies.
  • The Runtime log. The most common runtime is Docker, which can be checked by Docker logs, such as removing Pod Hang and other problems.
  • K8s component logs. APIServer logs can be used for auditing, Scheduler logs can diagnose scheduling, ETCD logs can view storage status, and Ingress logs can analyze access layer traffic.
  • Application logs. You can use application log analysis to view the status of the service layer and diagnose exceptions.

Log collection methods are divided into passive collection and active push. In K8s, passive collection is generally divided into Sidecar and DaemonSet, while active push includes DockerEngine push and business direct write.

  • The DockerEngine has the LogDriver function. You can configure different Logdrivers to write stDout of containers to remote storage through the DockerEngine to collect logs. This approach has low customizability, flexibility, and resource isolation, and is generally not recommended for production environments.
  • Service direct write integrates the SDK for collecting logs in an application and sends logs to the server through the SDK. This method eliminates the logic of disk collection and does not require additional Agent deployment. It has the lowest resource consumption for the system. However, due to the strong binding between the service and the log SDK, the overall flexibility is low.
  • DaemonSet runs only one logging agent on each node to collect all logs on this node. DaemonSet occupies much less resources, but its scalability and tenant isolation are limited. It is more suitable for clusters with single functions or not many businesses.
  • In Sidecar mode, a log agent is deployed for each POD. This agent collects logs of only one service application. Sidecar occupies more resources, but is flexible and isolated from multiple tenants. Therefore, you are advised to use this method in large K8s clusters or clusters that serve multiple service parties as PaaS platforms.

Mount host collection, standard INPUT and output collection, Sidecar collection.

To sum up:

  • DockerEngine direct writing is not recommended;
  • Service direct write is recommended for scenarios with a large amount of logs.
  • DaemonSet is commonly used in small and medium clusters;
  • Sidecar is recommended for use in very large clusters.

Event

Event monitoring is a monitoring method suitable for the Kubernetes scenario. Events include the occurrence time, component, level (Normal and Warning), type, and details. You can learn about the entire life cycle of an application, such as deployment, scheduling, running, and stopping. You can also learn about some exceptions that are occurring in the system.

One of the design concepts in K8s is a state transition based on a state machine. A Normal event occurs when a Normal state is changed to another Normal state, and a Warning event occurs when a Normal state is changed to an exception state. Normally, Warning events are the ones we care about. Event monitoring is the aggregation of Normal events or Warning events to the data center, and then through the analysis and alarm of the data center, some corresponding anomalies are exposed by means of nails, short messages, emails, etc., so as to realize the supplement and improvement of other monitoring.

Events in Kubernetes are stored in ETCD for only 1 hour by default, which does not allow for long cycle range analysis. After long-term storage of events and customized development, diversified analysis and alarms can be achieved:

  • Alarm abnormal events in the system in real time, such as Failed, Evicted, FailedMount, and FailedScheduling.
  • Often troubleshooting problems may look for historical data, so you need to look for events over a longer period of time (days or even months).
  • Events support categorization statistics, such as the ability to calculate the trend of events and the comparison with the previous period (yesterday/last week/before release) for judgment and decision making based on statistical indicators.
  • Support different personnel to do filtering and screening according to various dimensions.
  • Support custom subscriptions to these events for custom monitoring for integration with the company’s internal deployment operations platform.

NPD (Node Problem Detector) framework

The stability of Kubernetes cluster and its running container strongly depends on the stability of nodes. The components in Kubernetes focus only on container-management-related issues and do not provide much detection for hardware, operating systems, container runtimes, dependent systems (network, storage, etc.). The Node Problem Detector (NPD) provides a diagnostic inspection framework for Node stability. Based on the default inspection policy, it can flexibly expand the inspection policy. It can convert Node anomalies into Node events and push them to the APIServer. Event management by the same APIServer.

NPD supports a variety of exception checks, such as:

  • Basic service fault: The NTP service is not started
  • Hardware: The CPU, memory, disk, or nic is damaged
  • Kernel problem: The file system is damaged by Kernel hang
  • Container runtime problem: Docker hang, Docker cannot start
  • Resource problem: OOM

In summary, this chapter summarizes common Kubernetes observability schemes. In the production environment, we often need to use a combination of solutions to form a three-dimensional, multi-dimensional, mutually complementary observability system; After the deployment of observability solutions, anomalies and errors need to be quickly diagnosed based on the output results of the solutions, effectively reduce the false positive rate, and the ability to save, check back, and analyze historical data. Further, data can be provided to machine learning and AI frameworks to achieve advanced application scenarios such as elastic prediction, anomaly diagnosis and analysis, and intelligent operation and maintenance of AIOps.

This is based on observability best practices, including how to design, plug-in deploy, configure, and upgrade the various observability solution architectures described above, how to quickly and accurately diagnose and analyze causes based on the output, and so on. Ali Cloud container service ACK and related cloud products (monitoring service ARMS, log service SLS, etc.) realize and empower users with the best practices of cloud vendors through productization capabilities, providing comprehensive solutions that enable users to quickly deploy, configure, upgrade and master the observability solutions of Ali Cloud. Significantly improve the efficiency and stability of enterprise cloud and cloud biotechnology, reduce the technical threshold and comprehensive cost.

The following will take ACK’s latest product form, ACK Pro, as an example and introduce ACK’s observability solutions and best practices in conjunction with related cloud products.

ACK observability capability

Observability scheme

For indicator class observability, ACK can support two observability schemes of open source Prometheus monitoring and Ali Cloud Prometheus monitoring (Ali Cloud Prometheus monitoring is a sub-product of ARMS product).

Open source Prometheus monitoring, provided in the form of HELM package, adapted to Ali Cloud environment, integrated with nails alarm, storage and other functions; The deployment path is in the application directory of the CONSOLE. After the configuration, you can deploy it on the ACK Console. Users only need to configure helm package parameters in aliyun ACK console to customize the deployment.

Ali Cloud Prometheus monitoring, is a sub-product of ARMS products. Application Real-Time Monitoring Service (ARMS) is an Application performance management product, which includes front-end Monitoring, Application Monitoring and Prometheus Monitoring.

In Gartner’s APM Magic Quadrant evaluation in 2021, Ali Cloud Application Real-time Monitoring Service (ARMS) is the core product of Ali Cloud APM, which is jointly participated by cloud monitoring and logging services. Gartner comments on Aliyun APM:

  • China has the strongest influence: Alicloud is the largest cloud service provider in China, and alicloud users can use on-cloud monitoring tools to meet their observability needs.
  • Open Source integration: Ali Cloud attaches great importance to the integration of open source standards and products (such as Prometheus) into its platform.
  • Cost advantage: Compared with using third-party APM products on Ali Cloud, Ali Cloud APM products are more cost-effective.

The following figure summarizes the module division and data links of Prometheus and Ali Cloud.

ACK supports CoreDNS, cluster node, cluster profile and other K8s observability capabilities; In addition, ACK Pro supports the observability of managed management and control components, Kube API Server, Kube Scheduler, and Etcd, with continuous iteration. Users can quickly discover system problems and potential risks of THE K8s cluster and take corresponding measures to ensure the stability of the cluster through abundant market monitoring in Prometheus. Monitor platters integrate ACK best practices to help users analyze and locate problems from multiple dimensions. Here is how to design observability marketplaces based on best practices, along with specific case studies using monitoring marketplaces positioning issues to help understand how observability can be used.

First, let’s look at the observability of ACK Pro. Monitor the entrance of the large plate as follows:

APIServer is one of the core components of K8s and the hub for K8s components to interact. The monitoring design of ACK Pro APIServer takes into account that users can select the APIServer Pod that needs to be monitored to analyze single indicators, aggregate indicators and request sources, etc. At the same time, you can drill down to one or more API resources to observe the indicators of APIServer. This advantage is that you can not only observe the global view of all APIServer Pods globally, but also drill down to observe the monitoring of specific APIServer Pods and SPECIFIC API resources. Monitoring total and local observation capabilities is very effective for locating problems. So according to ACK best practices, the implementation consists of the following five modules:

  • Filter boxes for APIServer Pods, API resources (Pods, Nodes, ConfigMaps, etc.), quantiles (0.99, 0.9, 0.5), and statistical intervals are provided. Users can linkage control and monitor the market by controlling the filter box
  • Highlight key indicators to identify key system states
  • Display the monitor tray of APIServer RT, QPS and other single indicators to realize the observation of single dimension indicators
  • Display the monitoring tray of APIServer RT, QPS and other aggregation indicators to realize the observation of multi-dimensional indicators
  • This section describes the client source analysis for APIServer access

The implementation of the module is outlined below.

Key indicators

Showing the core indicators, The value includes total QPS of APIServer, Read Request success Rate, write Request success Rate, Read Inflight Request, Mutating Inflight Request, and Dropped Requests Rate.

These indicators show whether the system status is normal. For example, if Dropped Requests Rate is not NA, APIServer needs to locate and process the discarded request immediately because its ability to process the request cannot meet the requirement.

Cluster-Level Summary

This section includes non-list read Request RT, LIST read Request RT, write Request RT, read Request Inflight Request, modify Request Inflight Request, and the number of discarded requests per unit time. The implementation of this section is based on the BEST ACK practices.

As for the observability of response time, it can intuitively observe the response time of different resources, different operations and different ranges at different time points and intervals. You can choose different quantiles to filter. There are two important points of investigation:

  1. Whether the curve is continuous
  2. RT time

Let’s explain the continuity of the curve. The continuity of the curve makes it easy to see whether the request is a continuous request or a single request.

The following figure shows that APIServer receives PUT leases during the sampling period. The P90 RT is 45ms per sampling period.

Since the curve in the figure is continuous, it indicates that the request exists throughout the sampling period, so it is a continuous request.

The following figure shows that APIServer receives the request of LIST Daemonsets within the sampling period, and the P90 RT within the sampling period with sample values is 45ms.

Because there is only one in the figure, the request exists only during one sampling period. This scenario is based on requests from users executing Kubectl get DS –all-namespaces.

Now let’s explain RT.

Run the SLBkubectl create configmap cm1MB –from-file=cm1MB=./configmap.file command to connect to the public network

In the logs recorded by APIServer, the POST ConfigMaps RT is 9.740961791s, which can fall into apiserver_request_DURATION_seconds_bucket (8, 9). Therefore, a sample point will be added to the bucket corresponding to le=9 of apiserver_request_duration_seconds_bucket. In the observability display, 9.9s will be calculated according to 90 quantiles and graphically displayed. This is the relationship between the request real RT recorded in the log and the display RT in the observability display.

Therefore, the monitor market can be used in conjunction with the log observability function, and the information in the log can be visually summarized in a global view. Best practice suggests a comprehensive analysis of the monitor market and log observability.

I0215 23:32:19.226433 1 trace. Go :116] Trace[1528486772]: "Create" url: / API/v1 / namespaces/default/configmaps, the user-agent: kubectl/v1.18.8 (Linux/amd64) kubernetes/d2f5a0f,client:39.x.x.10,request_id:a1724f0b-39f1-40da-b36c-e447933ef37e (started: 2021-02-15 23:32:09.485986411 +0800 CST M =+114176.845042584) (Total time: 9.740403082s): Trace[1528486772]: [9.647465583s] [9.647465583s] About to convert to Expected Version Trace[1528486772]: [9.66055479s] [13.089129ms] Conversion done Trace[1528486772]: [9.660561026s] [6.317µs] About Store Object in Database Trace[1528486772]: [9.687076754s] [26.515728ms] Object Stored in database Trace[1528486772]: [9.740403082s] [53.326328ms] END I0215 23:32:19.226568 1 Httplog. Go :102] requestID= a1724f0b-39f1-40da-b36C-e447933ef37e Verb = POST URI = / API/v1 / namespaces/default/configmaps latency resp = 201 UserAgent = = 9.740961791 s kubectl/v1.18.8 (linux/amd64) kubernetes/d2f5a0f srcIP="10.x.x.10:59256" ContentType=application/json:Copy the code

Let’s explain that RT is directly related to the specifics of the request and the size of the cluster.

In the preceding example, a 1MB configmap is also created. The link length of the public network is 9 seconds due to the network bandwidth and latency. However, in the test of Intranet link, only 145ms is needed, and the influence of network factors is significant.

Therefore, RT is related to the resource object, byte size, network, etc. The slower the network is, the larger the byte size is, and the larger RT is.

For a large K8s cluster, the data volume of the full LIST (such as Pods, Nodes, etc.) can sometimes be large, resulting in an increase in the amount of data transferred and RT. Therefore, there is no absolute health threshold for RT indicators. It must be evaluated comprehensively based on specific request operations, cluster scale, and network bandwidth. If it does not affect services, it is acceptable.

For small K8s clusters, average RT 45ms to 100ms is acceptable; For clusters with 100 nodes, an average RT of 100ms to 200ms is acceptable.

However, if THE RT continues to reach the level of seconds, or even if the RT reaches 60 seconds, the request times out, exceptions occur in most cases, and it is necessary to further determine whether the processing meets the expectation.

These two metrics are available through APIServer /metrics. You can run the following command to view inflight Requests, which measures APIServer’s ability to process concurrent requests. If the number of concurrent requests reaches the threshold specified by the APIServer parameters max-requests-inflight and max-mutating-requests-inflight, the APIServer flow limiting is triggered. Usually this is an exception that needs to be located and handled quickly.

QPS & Latency

This section can visually display the QPS requests and the classification of RT by Verb and API resources for aggregate analysis. In addition, the error code classification of read and write requests can be displayed, and the error code types returned at different time points can be directly discovered.

Client Summary

This section visually displays the requested client and the operations and resources.

QPS By Client The QPS value of different clients can be collected By Client dimension.

QPS By Verb + Resource + Client Collects statistics on request distribution within 1s By Client, Verb, and Resource.

Based on ARMS Prometheus, ACK Pro provides Etcd and Kube Scheduler monitors in addition to APIServer. ACK and ACK Pro also provide CoreDNS, K8s cluster, K8s node, Ingress and other large markets. Users can view the large markets of ARMS. These marketplaces incorporate ACK and ARMS ‘best practices in production environments to help users observe systems in the shortest path, identify root causes, and improve o&M efficiency.

Logging observability solution

SLS Ali Cloud log service is the standard log solution of Ali Cloud and connects to various types of log storage.

For managed component logs, ACK supports managed cluster control plane component (KuBE-Apiserver/KuBE-Controller-Manager/KuBE-Scheduler) logs. Collect logs from the ACK control layer to the Log Project of the user SLS Log service.

For user-side logs, users can use Logtail and log-Pilot technical solutions of Ali Cloud to collect required container, system, and node logsto the LOGStore of SLS, and then view the logs in SLS.

Event observability scheme + NPD observability scheme

Kubernetes architecture design is based on the state machine. Transitions between different states will generate corresponding events. Transitions between Normal states will generate Normal level events, and transitions between Normal and abnormal states will generate Warning level events.

ACK provides an out-of-the-box event monitoring solution for container scenarios. The Node-problem-Detector (NPD) maintained by ACK and kube-Eventer included in NPD provide container event monitoring capabilities.

  • The Node-problem-detector (NPD) is a tool for Kubernetes node diagnosis. For example, Docker Engine Hang, Linux Kernel Hang, network out of network exception, file descriptor exception conversion into Node events, combined with Kube-Eventer can realize Node event alarm closed loop.
  • Kube-eventer is an open source Kubernetes event offline tool maintained by ACK. It can take cluster events offline to systems such as Tack, SLS and EventBridge, and provide different levels of filtering conditions to achieve real-time event collection, directional alarm, and asynchronous archiving.

NPD detects node problems or failures with third-party plug-ins based on configuration and generates cluster events. The Kubernetes cluster itself also generates various events due to the cluster state switch. For example, Pod expulsion, mirror pull failure and other exceptions. The Kubernetes event center of the Log Service (SLS) aggregates all events in Kubernetes in real time and provides storage, query, analysis, visualization, and alarm capabilities.

ACK observability outlook

ACK and related cloud products have implemented comprehensive observation capabilities for the Kubernetes cluster, including indicators, logs, link tracing, events, etc. Future directions include:

  • Mining more application scenarios, associating application scenarios with observability, and helping users better use K8s. For example, monitor the resource levels of containers such as memory /CPU in Pod over a period of time, and analyze Kubernets requests/limits using historical data. If not, recommend container resource requests/limits. Monitor the cluster APIServer RT excessively large requests, automatically analyze the causes of abnormal requests and handling suggestions;
  • Linking multiple observability technology solutions, such as K8s event and indicator monitoring, provides richer and more dimensional observability capabilities.

We believe that THE future development direction of ACK observability will be more and more broad, to bring customers more and more outstanding technical value and social value

Monitoring system

The monitoring system needs to be comprehensive, so that problems can be found in the first time when they occur, and avoid omission. Based on the accumulation of existing technologies, we divide monitoring into three sections for construction:

  • Resource monitoring: Used to monitor the resources required when the service is running. This part relies on open-Falcon, which has complete basic functions and is basically available out of the box with good enough scalability. UI interaction is anti-human and the barrier to entry is high. However, there is a limited number of people who need to work directly in this area, so we visualized the most commonly used resources based solely on Grafana.
  • Application monitoring: Monitors application availability and performance indicators, including link call traces of APP, Web front-end, and back-end services. The bottom layer relies on The Caesar system of Yan Xuan (based on the second development of pinpoint customized version), the front end, the client part is completed by Yan Xuan. Architecture:

  • Service monitoring: Monitors whether the application service logic is correct. Based on the concept of data lake, it provides real-time monitoring capability for massive data response in seconds. Users can quickly complete data source access, data model construction, monitor plate customization and alarm configuration through the platform:

  • All kinds of data sources (log /binlog) fast access to the market and alarm function

  • Real-time data monitoring during the promotion period. Support several log files tens of thousands to hundreds of thousands of RPS (records per second) second level calculations

  • High concurrency. For a number of market, each market a number of charts, can support the actual use of big promotion

Real-time Service Monitoring (GoldenEye) The bottom layer relies heavily on the self-selected logging platform and supports log collection of containerized applications. Here is an architecture diagram:

CI/CD

CI/CD is arguably the core process in DevOps, and Yan encountered several problems in this area:

  • Inconsistency in branch management strategies: Most of them are trunk publishing methods, but there are also branch publishing methods, even though they are all trunk publishing methods, there are differences in branch naming methods. Branch merging strategies also differ.
  • Uniformity of CI/CD tools: Some teams use Gitlab-CI; Some of them use Jenkins. With the combination of Gitlab-CI and code engineering, Jenkins configuration can be omitted, easy to use; Jenkins can be used to better control the necessary CI tasks, and various Jenkins plug-ins can be used to enrich, but it is necessary for each project team to have members who are familiar with Jenkins. There is also more than one system of publishing tools.
  • Inadequate coverage of automated tests: There are few modules that can truly achieve a high degree of automation, and in many cases they need to be manually triggered or executed and verified. There is frequent cross-communication between different functional roles throughout the CI/CD process.

The key to solve these problems is to unify the concept of “products” and arrange and decouple the whole CICD process with products as the center.

Originally, both in the unit testing phase and the joint tuning phase, the verification application is compiled and packaged directly from the code branch; Only after QA verification is completed, the production will be deployed online (there will also be merged into the trunk and tagged, and the compilation, packaging and release online process will be completed based on the specified tag during deployment). In this way, it is easy to appear that the products verified by the test environment are not the same as the products of the actual line (even if the code is the same, different compilation and packaging processes, such as the configuration file is different, will lead to the actual execution logic of the application package is different), and there are quality risks.

Our expectations: development delivered to QA and finally online deployment, all products is consistent, because the environment lead to the differences of different configuration information should be based on dynamic configuration center get (have approach is to pack all the configuration files in different environments within the products, and then through the environment variable dynamically selected, rather than through the center. However, this approach has some security risks, such as some connection addresses and keys on the line in the test environment).

Therefore, in order to implement this specification, Yan chose to change the timing of the production of “artifacts” : from Continuous Delivery to Continuous Integration, ensuring that QA’s validation process starts with artifacts and not with the code base. In addition, the production specification of the product to CI stage is also more suitable for the construction of application container.

Finally, the solution for CI/CD is:

  • The CI process from code to product is completely completed by Gitlab-CI. Sorting out branch management strategies, triggering different integration processes, unified by development. Engineering compilation, packaging, code specification and follow-up revision of basic tests are more familiar to development, and gitlab-CI is adopted to integrate with engineering more closely. It will be more natural to adjust corresponding CI tasks when engineering changes. At the same time, implementing Pipeline as Code will facilitate the subsequent evolution to Auto-DevOps.
  • Candidate products pass QA testing and are finally determined as deployable validation processes within the automated testing system. Currently, the necessary validation tasks are controlled through a self-selected quality management platform to ensure the implementation of product quality approval specifications. This can be done entirely by QA, eliminating the need for strongly coupled development. Through the continuous construction and improvement of automated test system, the execution efficiency will be higher and higher.
  • Production management, release planning, and deployment on different resources are implemented on the Opera release platform.

R&d performance platform

The premise of the efficient operation of the whole DevOps system is that there needs to be a unified assembly line, which links each link organically and opens up the capacity in each vertical field. In Yanxuan, Overmind of the Group is currently selected as the solution for this sector, with the following three key features:

  • “One-stop” work style

  • Build an end-to-end continuous delivery pipeline that allows the r&d team to focus on value flow rather than on the to-do list at each stage, reducing the cost of using different platforms.

  • visualization

  • The visualization of the whole process makes it easy to control the stuck points in the process and ensure the standard and process quality of the whole link.

  • Measurement analysis of the whole process

  • “If You Can’t Measure It, You Can’t Improve It”, through the performance platform Can connect the performance data of different fields, find the correlation between data and data, so as to look at the bottleneck that Can be improved and optimized from a more global perspective, and avoid long-term stopgap treatment. The state of putting out a fire.

The subsequent

Strictly selected DevOps tool chain will continue to be built in an upward spiral with the development needs of the business and R&D team: in one stage, focus on building capacity systems in various fields to improve the depth of functionality and execution efficiency in specific directions; The first phase focuses on building cross-domain capacity collaboration platforms to better align different capabilities, reduce usage costs, and examine gaps in capacity coverage.

The current construction is mainly in the following aspects:

  • Risk control in the process of change, integration of change events in different systems, unified definition and quantification of risk, and assisted decision making for CD.
  • Environmental governance of services to resolve environmental conflicts in DevOps processes.
  • Standardization of existing capability interface for plug-in facilitates subsequent capability iteration and output.

A recent panorama is used to conclude this paper:


Updates continue at.....

Personal website

Go Basic Direct

More exciting public account “51 Operation and Maintenance com”