The author | Ye Jianhong

background

Alibaba cloud service grid ASM was publicly tested in February 2020. In the past two years, a large number of users have adopted it as a service governance platform for production applications. Alibaba cloud service grid ASM is built based on open source Istio. At the same time, Istio is still young. 2021 will see many new developments in Istio, such as eBPF, Multi-Buffer, Proxyless, etc. Every change in Istio will affect people’s nerves — is it time to adopt service grid?

This article is not intended to review the changes or trends in Istio or THE ASM service grid of Ali Cloud. Let’s talk about the ASM service grid of Ali Cloud and how it is used by its end users.

Why service grid?

The idea of a service grid is to sink service governance capabilities into the infrastructure and make the business more focused on business logic. We can easily cite service governance capabilities such as traffic management, observability, secure service communication, policy enhancements such as traffic limiting and access control that service grids bring. But do end users really need these features? Do Istio’s capabilities really meet production requirements? Is it worth the hassle to introduce a service grid for one or two functions?

For example, Traffic Splitting, which is often used for grayscale publishing or A/B testing, is the most popular topic in Traffic Splitting. Istio has a simple function design, but it cannot manage traffic on all links by default. Without custom headers or labels transparently passed across all microservice topology nodes, relational service traffic cutting is completely impossible.

Security capabilities, for example, are almost Impossible Mission with traditional means of TLS authentication for a large number of microservices, whereas mTLS encryption offered by Envoy is extremely easy for encrypted communication between services, or zero trust security. However, do users really care about security between services, since most of these microservices run in a Kubernetes cluster in a VPC on the cloud?

For example, observability. Istio integrates Metrics, Logging, and Tracing to easily obtain microservice topologies and quickly locate microservice problems. However, Sidecar does not delve into the process level, and Metrics and Tracing, the related software, are dwarved by the classic APM software to really address code-level issues.

Finally, policy enhancements such as traffic limiting capabilities are required. Istio provides Global Rate Limit and Local Rate Limit traffic limiting capabilities, which are indeed strong requirements for a large number of C-terminal applications. But can it really meet the needs of complex production applications for limiting traffic downgrading? Real production environments vary, and service grids encounter various challenges in the process of landing. What are the capabilities of the service grid that end users are most concerned about, and what are the practical experiences in the implementation process?

What are the main capabilities of the service grid used by users?

I’ll leave the above questions unanswered. Let’s take a look at what capabilities of the service grid are mainly used by ASM users of Ali Cloud Service Grid and I’m sure readers will form their own answers.

Traffic management

The first, of course, is traffic management, which Istio’s ability to most significantly improve the happiness of app launches. Most users of ALIBABA Cloud service grid ASM choose ASM service grid for traffic management. Flow management is mainly used in grayscale publishing or A/B testing. The most common application scenarios are as follows:

The gray flow switching in the above figure takes place on the Ingress gateway, and the internal services are closed in their respective namespaces. The scheme is simple and effective. The disadvantage is that each gray scale needs to deploy the full micro service in the Namespace of gray scale. Another naive idea is to achieve full-link grayscale publishing, which I sometimes like to call Dark Release. What is full-link grayscale publishing? As shown below:

One or more of these services can be grayed arbitrarily without the need for high-cost local deployment of full microservices. Header-based traffic routing based on community Istio can realize full-link grayscale publishing, on the premise that transparent transmission of customized headers is developed in each service of the whole chain.

This approach takes a bit of effort, each service requires intrusive modifications, and often only new projects and applications can be designed in this way from the start. Is there an alternative? Ali Cloud service grid ASM provides a full-link grayscale publishing scheme based on Tracing. The principle isn’t that complicated, and since full-chain microservices require a header or label to concatenate service request associations, Traceid is an obvious ready-made “connector.”

Compared with custom header passthrough, tracing also requires code intrusion. However, Tracing has open source standards. While realizing Tracing, it also enables full-link traffic management ability, which is the power of open source standards. In addition, for Java applications, Ali Cloud ARMS provides Tracing access without code immersion, and can realize full-link grayscale publishing without code modification in cooperation with ASM of Ali Cloud service grid.

When we go back to the landing scene, ASM users often have small and medium-sized enterprises or applications that can establish complete Tracing Tracing, but there are many applications from large companies. Tracing links are broken, which is really a headache. Fortunately, the grayscale of associated services usually occurs in the “local”, and the integrity of local service links can meet the requirements of grayscale.

North-south flow management

What we have discussed above is mainly east-west flow management, while north-south flow management is a more ecologically rich area. Solo’s Gloo Edge and Gloo Portal are exemplary in this area, as well as several Envoy – or Mesh-oriented API gateways in China. What is the difference and connection between istio-IngressGateway and Lagacy API Gateway? There’s been a lot of discussion in the community, and my personal view is that there’s no significant difference in atomic capability, just the interface facing it and the current ecology.

The reason for some users to adopt ASM in Ali Cloud service grid is not the need for service governance, but the enhanced capabilities of IStio-IngressGateway. Istio’s Gateway, VirtualService, and DestinationRule definitions are clearly clearer and more layered than The Kubernetes Ingress model, combined with the Envoy’s powerful extension capabilities, Envoy and istio-ingressGateway are becoming increasingly popular in gateway selection. A simple example, gRPC load balancing, is so easy to implement in an Envoy/Istio that many users’ Istio selections start from this. For example, for AI Serving reasoning service scenarios, the service link is short and the delay caused by Sidecar is negligible.

Istio/Envoy extensions on gateways are currently mostly based on Lua or WASM, and there are a lot of custom capability extensions available through Envoyfilter. Landing challenge is also very simple and straightforward, users said, “I can’t write Lua, let alone WASM.” Cloud manufacturers say it doesn’t matter, I write ah, according to the scene of the expansion of the thing to write more, you can put together to do a plug-in market, according to the need to choose. From this year’s user perspective, WASM knows a lot, but it’s still complicated to use.

Take a common application scenario — entrance flow labeling, or flow coloring. Traffic is marked based on the characteristics of incoming traffic, such as the source private network or Internet, and login user information. After marking, flow diversion or grayscale release.

Egress Traffic control. Many users with high application security requirements use the Istio Egress gateway to control the access range of layer 7 applications. Three or four layers of Network Policy are easy to make, and service grid can be considered for seven layers of Network Policy.

Multilingual service governance

We talked about Istio traffic management above, and it seems that the problem has been basically solved. However, there is a hidden premise that is often overlooked — the traffic management capability is only valid if the service is discovered using the Kubernetes service, that is, the call between services requires the Host Header or access to the Kubernetes clusterIP. In the real world, there are a number of microservice applications running on ACK that use registries as service discovery and exist in multiple languages. We see multilingualism becoming more common, often as a result of rapid business growth. In order to quickly meet the requirements, different teams on different projects chose different languages for development, and service governance requirements came later. These microservices could be Dubbo or Spring Cloud microservices using ZK, Eureka, Nacos, or Go, C++, Python, and PHP microservices applications using Consul, ETCD, and Nacos. These services fetch a list of instance Pod IP from a registry, bypassing the Envoy filter chain directly via the Envoy filter cluster as PassthroughCluster, leaving traffic management and other Istio capabilities out of the way.

As a result, the multilingual, non-intrusive microservices governance promised by Istio from its inception has been a bumpy road to the real world. How can microservices from different registries and microservices registered to Kubernetes and Pilot play happily?

A simple solution is to remove the registry, use Kubernetes CoreDNS Service discovery, and use Service Mesh. ASM users usually use this solution for newly developed applications or reconfigured applications with short service links. However, if a lot of applications adopt this scheme, challenges such as intrusive modification of development and smooth migration of applications will be considered, which will face more obstacles in the practical implementation.

Should I keep the original application architecture or design for Kubernetes? Go left, go right? It ‘s a Question.

For the scenario that needs to retain the registry, Aliyun has designed two schemes:Service Discovery SynchronizationandService discovery interception.

What is service discovery synchronization? Since the source of the problem is both the Nacos/Consul registry and Pilot registry, synchronize them with each other. Nacos/Consul is synchronized to Pilot via MCP over xDS so that the Mesh side service can find the service on the left. If the service on the left wants to access the Mesh service in the same way as before, add a synchronization component to synchronize the Pilot registration information to Nacos. This solution is slightly more complex, but the advantage is to keep the architecture and development approach as much as possible. Ali Cloud MSE can realize the service governance of Java side microservices.

Let’s look at another solution — service registration interception, or the full Mesh solution.

The principle is very simple, tampering returned registered instance IP information such as Nacos by Sidecar interception into Kubernetes clusterIP, making the Envoy Filter chain reappear. Or it can be summed up as “Keep it, But ignore it”. The benefits of the full Mesh solution are also obvious. All microservices are managed through a unified Mesh stack (including data side and control side), making the solution clear and development neutral.

At present, both of these two solutions have been implemented among Aliyun users, so it may be up to you to analyze which one is more suitable for your application.

Service security

When it comes to security capabilities Istio provides, the first is mTLS certificate encryption. Under Istio default Permissive policy, all services in the same grid automatically obtain mTLS encryption capability (some users seem not to realize that it is enabled by default). Domestic users pay little attention to this capability, but ASM overseas users have strong demand for mTLS. An overseas user’s understanding is that nodes in an Istio or Kubernetes cluster may be distributed in multiple AZs on the cloud, while azs in a public cloud are distributed in different machine rooms. Therefore, cross-machine room traffic may be exposed to non-cloud managers and detected, and may also face the risk of theft from machine room managers and o&M personnel. Overseas users generally pay more attention to security, which is also a “cultural” difference between China and foreign countries.

Another topic that is mentioned by many users is custom external authorization. By default, Istio supports simple authentication and authorization, such as authentication based on sourceIP and JWT. More complex authorization is extended by Istio’s customized external authorization capability. Authentication and authorization at the gateway are easy to understand, and the complex engineering of authorization for arbitrary services within the Mesh is easily simplified under Istio.

Multi-cluster service grid

Istio native supports multiple Kubernetes clusters, and ALIBABA Cloud Service Grid ASM products have simplified access to multiple clusters. Why multiple clusters? At present, there are two types of ASM users: one is unified Mesh management of multiple service mediums. Service mediums are deployed in different Kubernetes clusters, which are relatively independent and directly invoke each other between service mediums. Through the Mesh, service governance can be implemented across the service medium. The second method is cross-AZ or Region dual-cluster application Dr. The Locality Load Balancing function of Istio enables a cross-AZ or Region Dr Switchover.

observability

Observability refers to monitoring. Of course, observability has a richer meaning in the cloud native context, with more emphasis on advance perception and active intervention. Istio’s enhancements to observability are primarily in providing rich protocol layer metrics and service Sidecar logs. For example, circuit Breaking, a user might think that the sidecar on the grid is a black box, and it’s not very clear what’s going on inside, but in fact the Envoy gives a very clear indication of the fusing process. However, we have found that a large number of users are not aware of the rich metrics provided by Istio and ASM and often do not take advantage of them. This may be due to the user’s Istio adoption stage or feature priority, but more likely to be due to a lack of product visibility as a cloud vendor. We still have work to do.

In addition, Mesh integrates Metrics, Logging, and Tracing at the application level. More and more users use Grafana to access these three data sources for unified observability analysis.

Strategies to enhance

In the policy enhancement part, we mainly look at the traffic limiting capability. Community Istio provides Global Rate Limit and Local Rate Limit. Global Rate Limit requires a unified Rate Limit Server. The Local Rate Limit is delivered to the Sidecar Local configuration through EnvoyFilter. The problem with Global Rate Limit is that it relies on centralized traffic limiting services and introduces new delays during each service Sidecar interception, while Local Rate Limit lacks Global judgment and is inconvenient to configure EnvoyFilter. We feel that EnvoyFilter should be introduced with great care in production environments, and Envoy version updates can easily cause incompatibilities.

Alibaba cloud service grid ASM integrates Sentinel Filter based on Envoy Filter mechanism. Sentinel was originally alibaba’s open source flow limiting fuse project, which is now integrated into Envoy to provide richer and production-oriented performance and policy enhancements. Support for flow control, queuing, fusing, downgrade, adaptive overload protection, hot spot flow control and observable capabilities.

Service grid production practices

In terms of production applications, both Envoy and Pilot perform well, not in small to medium scale scenarios (1000 pods). In large scale (more than 1000 pod), the corresponding configuration details need to be optimized accordingly. Observable access, Envoy Sidecar logging has little impact on performance, metrics enabled on a large scale may cause Sidecar memory increase, tracing sampling rate needs to be controlled in a production environment.

There are some open source solutions. ASM also provides an automatic recommendation for Sidecar configuration based on access log analysis, so that the Sidecar on the corresponding workload will only focus on the service information with which it has call dependence.

In some of the functional configurations of Istio, there are also some considerations, such as the relationship between timeout and backoff delay of retry policies, and the avoidance of stampede effects.

Some other things to note are the proprietary deployment and configuration optimization of IStio-IngressGateway in high concurrency scenarios, and the elegant on-ramp and off-ramp of Sidecar. I will not go into details here.

The future of the service Grid?

As the pioneer of hosted service grid, ALI Cloud service grid ASM has gained a large number of users, who have strengthened our confidence in doing this product. The service grid is no longer a pile of buzzwords, but a truly productive application that handles one trivial technical problem after another. Back to its essence, the service grid is still about solving business problems.

The service grid community is thriving and the ASM product still has some work to do, but it has gained momentum in market validation. The epic story of the service grid is just beginning.