I was honored to serve as the producer and lecturer of cloud Native special session of GIAC conference 2021, and organized four lectures before and after. In this process, I learned a lot of useful knowledge from these peers’ speeches as an audience. This paper is a side-note to the 2021 GIAC cloud native special session, a peek at the status quo of cloud native technology development in 2021 and the trend in the future.

The term cloud native has a wide range of meanings, including efficient utilization, delivery, deployment, and operation and maintenance of resources. From the system level, it can be divided into cloud native infrastructure (such as storage, network, management platform K8s), cloud native middleware, cloud native application architecture and cloud native delivery and maintenance system. The four topics of this special session also basically cover the four directions:

  • 1. Explosion Radius Governance of a Cloud Native Service by Huang Shuai, a senior technical expert at Amazon
  • 2. Kuaishou Middleware Mesh Practice by Jiang Tao, head of service grid of Kuaishou Infrastructure Center
  • Monitoring Kubernetes Events using SkyWalking by Ke Zhenxu, Observability Engineer at Tetrate
  • 4. Dubbogo 3.0: The Foundation of Dubbo in the Cloud Native Era by myself as the Leader of Dubbogo Community

Here are the main points of each issue, based on personal notes and personal recollections. Because of time and my ability is limited, a few mistakes are hard to avoid, still ask experts to correct a lot.

1 Explosion radius of cloud native services

Personally, huang’s topic falls under the category of cloud native application architecture.

He started with a glitch in Amazon’s AWS from a decade ago: AWS configuration of a service center is a CP system, an artificial network changes lead to the center of the redundancy backup broken, after the emergency recovery operations staff changes, because the configuration center is unavailable led to effective replicas less than half 】 【 the whole storage system with other data nodes that configuration data consistency incorrect denial of service, Eventually, the entire system service crashed.

The direct cause of the whole accident is that the CAP theorem has very strict definitions of availability and consistency, which is not suitable for practical production systems. Therefore, the data that serves as the center of the configuration of the online control plane should be available first, on the premise of ensuring final consistency.

Furthermore, artificial operation error of modern distributed systems, network anomalies, software bugs, network/storage/computing resource depletion and so on are inevitable, distributed in the era of the designers are generally through a variety of redundant copy such as storage partition, service 】 【 means to guarantee the reliability of the system, on the hardware and software system of unreliable build reliable services.

But there’s a catch: sometimes some of the redundancy can lead to a decrease in system reliability due to the avalanche effect. In the incident above, a human configuration error led to a series of highly correlated software system failures, resulting in an avalanche effect known as the “poison effect of horizontal spread.” At this time, the dimension of thinking is further expanded from “providing reliable services on unreliable software and hardware systems” to “reducing the explosion radius of accidents through various isolation means” : when inevitable faults occur, the failure loss should be controlled to the minimum as far as possible to ensure that the services are available within an acceptable range.

In view of this idea, Huang gave the following fault isolation methods:

  • Moderate service granularity

    The service granularity of microservices is not the finer the better. If the granularity of services is too fine and the number of services is too large, the first consequence is that almost no one in the organization can understand the overall logic of the services, which increases the burden on maintenance personnel: everyone is willing to make small changes and no one is willing to make big improvements.

    The second consequence of fine-grained services is an exponential increase in the overall body of microservice units, which increases the cost of container orchestration deployment. Moderate service granularity allows for the evolution of the architecture and the reduction of deployment costs.

  • Sufficient isolation

    During service choreography, obtain the power supply and network topology information of the data center to ensure that strongly related systems are deployed not far apart from each other.

    “Not close” means that copies of the same service are not deployed in the same cabinet that uses the same power supply or in the AZone that uses the same network plane. For example, multiple replicas can be deployed on multiple IDCs in the same city. Use these two principles to balance performance and system reliability.

  • Random partition

    The essence of the so-called random partitioning is to mix service requests to ensure that the request of a service can go through multi-channel [queue], and ensure that the request processing of a service will not be affected in the case of some channel failure. The random partitioning technology is applied to disperse users in multiple cells and greatly reduce the explosion radius.

    It is similar to Shuffle Sharding in K8s APF fair current limiting algorithm.

  • Chaotic engineering

    Through continuous internalized chaos engineering practice, advance thunder, as far as possible to reduce the “point of failure”, improve the reliability of the system.

2 Monitor Kubernetes events using SkyWalking

Although this topic is arranged in the third lecture, it belongs to cloud native delivery and maintenance system, but it is closely related to the last topic, so it is described here first.

How to improve the observability of K8s system has always been the central technical problem of various cloud platforms. The basic data of K8s system observability is K8s event, which contains the full link information of Pod and other resources from request to scheduling and resource allocation.

SkyWalking offers observability such as logging/metrics/ Tracing. SkyWalking was originally intended for observation microservices, but now offers SkyWalking – Kubernetes-event-Exporter interface. It is dedicated to listening for K8s events, purifying them, collecting them, and sending them to the SkyWalking back end for analysis and storage.

During his speech, Ke spent a lot of energy on how to enrich the visualization effect of the whole system. The points of personal interest are shown in the following figure: Filtrating and analyzing events in a way similar to big data streaming programming.

Its visualization effect and flow analysis method can be referenced by ant Kubernetes platform.

3. Kuaishou middleware Mesh practice

In this topic, Jiang Tao of Kuaishou mainly explained the practice of Service Mesh technology of Kuaishou.

Ginger divides the Service Mesh into three generations. In fact, there are a lot of criteria, how to divide it all makes sense. It’s clear that Kang is putting the Dapr in the third generation.

The figure above shows kuaishou’s Service Mesh architecture, which clearly draws on Dapr ideas: sink the capabilities of the underlying components into the data plane and standardize request protocols and interfaces. Some specific jobs are:

  • 1 Unified operation and maintenance, improve observability and stability, fault injection and flow recording
  • 2. Secondary development of Enovy, which only transfers the changed data and obtains the data on demand to solve the problem of too many single instance services
  • Protocol stack and serialization protocol have been optimized
  • 4. Failure-oriented design is implemented. Service Mesh can fallback to direct connection mode

Of personal interest is that Jiang mentioned three challenges to the Service Mesh technology in a fast landing:

  • Cost: unified deployment and o&M in a complex environment.
  • Complexity: Large scale, high performance requirements, and complex policies.
  • Landing promotion: it is not a strong demand for business.

In particular, the third challenge is that the direct benefit of Service Mesh is not usually on the business side, but on the infrastructure team, so there is no strong demand for the business, and the real-time business platform of Fast hand is very performance sensitive, and Service Mesh technology inevitably brings increased latency.

In order to promote the implementation of Service Mesh technology, the quick solution is:

  • First of all, ensure the stability of the system, do not rush to roll out the business volume;
  • 2. Actively participate in the upgrading of business structure for major projects of ride-sharing company;
  • 3. Based on WASM scalability, co-construction with business;
  • 4. Select typical landing scenes and set up benchmark projects.

Jiang finally gives the Service Mesh work of Kuaishou in the second half of the year:

Obviously, this route is also deeply influenced by Dapr, with little innovation in theory or architecture, and more emphasis on standardization and fast implementation of open source products.

In his speech, Jiang mentioned two benchmarks for the implementation of Serivce Mesh technology: Ant Group and Bytedance. In fact, one of the most important reasons for their success is the high-level attention to advanced technology and business side of the great cooperation.

4. Dubbogo 3.0: Dubbo’s cornerstone in the cloud-native era

As a lecturer on this topic, I didn’t emphasize the existing features of Dubbo 3.0 in my talk, given that the audience spent $7,800 on tickets. I focused on the form of Service Mesh and flexible services.

One of the most important points of Dubbo 3.0 is Proxyless Service Mesh. This concept is actually the origin of gRPC, and also the focus of recent gRPC ecological efforts. Its advantages are losseless performance and easy to upgrade micro-services. However, gRPC’s own multilingual ecosystem is very rich, and another reason why gRPC promotes this concept is that as a neutral framework that emphasizes stability, its performance is not very good, especially if you consider the Proxy Service Mesh form.

The biggest disadvantage of the Dubbo ecosystem is that it does not have a good multi-language ability except For Java and Go. Personally, I think it is not a good idea to follow the steps of gRPC Handan and completely exclude other language ability. Dubbogo community produced Dubbogo Project in the gateway and sidecar two forms to solve the Dubbo ecological multi-language ability, the north-south flow and east-west flow into pixiu unified.

No matter what the form of Service Mesh technology, and its development in China has been through the first wave of high tide, the ant group and bytes, beating both after benchmarking to sparse itself also needs to evolve, more closely combined with the business for small and medium-sized manufacturers to see their business value, will usher in the second wave of high tide. Service Mesh itself is particularly suitable for helping small and medium-sized vendors migrate their services to hybrid or multi-cloud environments on K8s. Most of these environments use a large number of open source software systems to help them get rid of their dependence on specific cloud vendors.

The flexible service of Dubbo 3.0 can be basically understood as backpressure technology. The reason why Dubbo and Dubbogo want to make flexible services is that in the cloud native era, node anomaly is normal, and the accurate evaluation of service capacity is uncertain:

  • 1. Machine specifications: Machine specifications are inevitably heterogeneous under large-scale service [such as affected by oversold], and the aging rate of the machine is not the same even with the specifications
  • 2 complex service topology: The distributed service topology is evolving
  • 3 Uneven service traffic: there are peaks and troughs
  • 4 Dependent upstream service capability uncertainty: The cache/DB capability changes in real time

The solution lies in adaptive traffic limiting on the server side and adaptive load balancing on the client side.

Queue_size = limit * (1 – rt_noload/rt), the meaning of each field is as follows:

  • Limit Specifies the limit of QPS for a period of time
  • Rt_noload The minimum value of RT over a period of time
  • Rt The average rt over a period of time, or you can just call it P50 RT

That is, two forms of RT are used to evaluate the appropriate performance of a method-level service. RT increase reflects the overall load {CPU/memory/network/goroutine}, performance will drop. Conversely, the decrease in RT reflects the server’s ability to handle more requests.

Adaptive current limiting: The server calculates queue_size at the method level and infloutight the number of goroutines used by the current method. Queue_size = queue_size = queue_size = queue_size = queue_size = queue_size = queue_size = queue_size The queue_size-inflight difference is fed back to the client through the response packet.

Adaptive load balancing: When the client receives the load queue_size-inflight of a method returned by the server through the heartbeat packet or response, it can use the load-balancing algorithm based on the weight to make service calls. Of course, in order to avoid the instantaneous pressure caused by herd effect on a certain service node, the P2C algorithm can also be provided, and Dubbogo can be implemented for users to choose.

The above overall content, the community is still under discussion, not the final implementation version.

5 the otc

From 2017 to now, I have personally participated in more than a dozen domestic technical conferences at various levels, both as a producer and lecturer. The speech level is not high, but the basic time control ability is ok, do not pull the field. This time, the host of GIAC’s cloud native sub-stage, the audience score of this special stage is 9.65 [all special horizontal scores], the overall performance is ok.

It is a privilege to live in this era and witness the rise and fall of the tide of cloud native technology. I am also very lucky to work in Ali this platform, witnessed the Dubbogo 3.0 in Ali Cloud nail interior of each scene gradually landing.

Recommended Reading of the Week

  • Ant Group SOFATracer principle and practice

  • Ant Group ten thousand scale K8s cluster etCD high availability construction road

  • We built a distributed registry

  • Still struggling with multi-cluster management? OCM come!

For more articles, please scan the code to follow the “financial level distributed Architecture” public account