The author | Wang Xi ning Alibaba senior technical experts

Pay attention to the “Alibaba cloud original” public account, participate in the end of the message interaction, that is, the opportunity to get the book benefits!

This article is extracted from the book “Istio Service Grid Technology Analysis and Practice” written by Wang Xining, a senior technical expert of Ali Cloud. Starting with basic concepts, this article introduces what is service grid and Istio, and systematically and comprehensively introduces the relevant knowledge of Istio service grid in view of the three development trends of service grid in 2020. You only need to be happy to participate in the interaction at the end of the public account, we are responsible for the bill! The essential book Istio Service Grid technology analysis and Practice is free

It is pointed out in foreign languages that Service Mesh technology will have the following three major developments in 2020:

  • Rapidly growing service grid requirements;
  • Istio is hard to beat and is likely to become the de facto standard for service grid technology;
  • As more service grid use cases emerge, WebAssembly will open up new possibilities.

What is a service grid

Gartner’s 2018 Service Grid Technology Trend Analysis report presents a range of service grid technologies that are classified based on whether an application’s service code must be aware of its service grid and whether or how locked it is.

Grid technology based on programming frameworks can help developers build well-architected services, but it leads to tight coupling of application code to the framework and runtime environment. Sidecar agent-based service grid technology eliminates these barriers for developers, makes it easier to manage and maintain, and provides a more flexible way to configure runtime policies.

In a microservice environment, a single application can be decomposed into independent components and deployed as distributed services, which are typically stateless, transient, and dynamically extensible, running in container orchestration systems such as Kubernetes.

Service grid is generally composed of control plane and data plane. Specifically, the control plane is a set of services that run in a dedicated namespace. These services perform control management functions, including aggregating telemetry data, providing user-facing apis, and providing control data to data plane agents. The data plane consists of a series of transparent agents running alongside each service instance. These agents automatically process all traffic in and out of the service, and because they are transparent, these agents act as an out-of-process network stack, sending telemetry data to and receiving control signals from the control plane.

Service instances can be started, stopped, destroyed, rebuilt, or replaced as needed. Therefore, these services need a communication middleware to support the dynamic discovery and self-healing connectivity capabilities of services so that these services can communicate with each other in a secure, dynamic and reliable manner, which is supported by the service grid.

The service grid is a dedicated infrastructure layer that makes service-to-service communication more secure, fast, and reliable. If you are building cloud-native applications, you need a service grid. Over the past year, the service grid has emerged as a key component of cloud native applications that reliably deliver requests through complex service topologies containing modern cloud native applications. In fact, service grids are often implemented as a combination of lightweight network agents that are deployed with application code without the need to know what the application is.

The concept of a service grid as a separate layer is related to the rise of cloud native applications. In the cloud native model, a single application may contain hundreds of services, with thousands of instances of each service, and each instance may be in a constantly changing state. This is why coordinators like Kubernetes are increasingly popular and necessary. Communication between these services is not only becoming more complex, but also the most common part of the runtime environment, so managing communication between these services is critical to ensuring end-to-end performance and reliability.

A service grid is a network model, an abstraction layer on top of TCP/IP. It assumes that the bottom three or four layers of the network exist and can transfer bytes from one point to another. It also assumes that the network is as unreliable as other aspects of the environment, so the service network must also be able to handle network failures. In some ways, the service grid is similar to TCP/IP. Just as the TCP protocol stack abstracts the mechanism for reliably transferring bytes between network endpoints, the service grid abstracts the mechanism for reliably transferring requests between services. Like TCP, the service grid doesn’t care about the actual payload or how it’s encoded, just getting from service A to service B, and doing so while handling any failures. Unlike TCP, however, the service grid does more than just “make it work”; it also provides a unified application control point for introducing visibility and control to the application runtime. The explicit goal of the service grid is to move service communications out of the invisible domain of the infrastructure and into part of an ecosystem that can be monitored, managed, and controlled.

In a cloud native application, it is not easy to ensure complete reliability of requests. Service networks manage this complexity through a variety of powerful technologies, supporting mechanisms such as circuit breakers, delay-aware load balancing, ultimately consistent service discovery, retries, and timeouts to ensure maximum reliability. These functions must all work together, and their interaction with the complex environment in which they operate is also important.

For example, when a request is made to a service through a service grid, the interaction can be roughly simplified as follows:

  • The service grid component determines the desired service by applying dynamic routing rules. Should requests be routed to production or pre-published services? Is it routed to a local data center or a service in the cloud? Do you want to grayscale to the latest version of the service you are testing, or still route to an older version that has been validated in production? All these routing rules are dynamically configurable and can be applied globally or to any traffic segment.

  • Once the correct destination is found, the service grid component retrieves the corresponding instance pool, possibly multiple instances, from the relevant service discovery endpoint. If this information differs from what the service grid component observes in practice, it determines which information sources to trust;

  • The service grid component selects the instance that is most likely to return a quick response based on a variety of factors, including observed latency data for recent requests;

  • The service grid component attempts to send the request to the selected instance, recording the latency and response type of the response result;

  • If the instance has been down for any reason, or the request is not responding at all, or the request cannot be processed for any other reason, the service grid component will retry the request on another instance as needed, provided it knows that the request is idempotent;

  • If the instance consistently returns an error, the service grid component evicts it from the load balancing pool for periodic retries later. This situation is very common in distributed Internet applications, instances in public networks are very likely to cause instant failure due to some reasons.

  • If the timeout point for a request has passed, the service grid component actively fails the request rather than adding load by further retries to prevent an avalanche. This is critical for distributed Internet applications, where a glitch could cause an avalanche;

  • At the same time, service grid components capture all aspects of the above behavior in the form of metrics and distributed tracking and send this data to a centralized measurement system or link tracking system.

It is important to note that these capabilities provide point-by-point flexibility and application-wide flexibility to distributed applications. Large-scale distributed systems (no matter how constructed) have a clear characteristic: any small localized failure has the potential to escalate into a system-wide catastrophic failure. The service grid must be designed to prevent these failures from escalating by reducing load and failing quickly as the underlying system approaches its limits.

Why is a service grid necessary

The service grid itself is not a new feature, but rather a shift in the location of functionality. Web applications must always manage the complexity of service communication. The origins of the service grid model can be traced to the evolution of these applications over the past fifteen years.

At the beginning of this century, the typical architecture for mid-sized Web applications was a three-tier application architecture, divided into the application logic layer, Web services logic layer, and storage logic layer, all separate layers. Communication between layers is complex but limited in scope. The application architecture at this point does not have a grid, but there is communication logic between the code processing logic at each layer.

As networks grow to very large scales, this architectural approach becomes overstretched. Large Internet companies, in particular, are facing huge traffic demands and have implemented a precursor to an effective cloud-native approach: application layers are broken up into many services, now commonly known as “microservices,” a means of topological communication between layers. In these systems, they often take the form of “fat client” libraries, the Netflix-like OSS libraries described earlier, exemplified by Hystrix’s circuit breaker capabilities. These code bases, while context-specific and requiring specific languages and frameworks, were used to manage the form and capability of communication between services, which was a good choice at the time and was used by many companies.

After entering the cloud native era, the cloud native model has two important factors:

  • Containers (such as Docker) provide resource isolation and dependency management;
  • The choreography layer, such as Kubernetes, abstracts the underlying hardware into a homogeneous pool of resources.

Although the code library components in a certain extent, allows the application to load extension capability, and deal with always exist some faults in the cloud environment, however, with the increase of hundreds or thousands of service instance, and the existing from time to time to dispatch instances of the business process layer, follow by a single request through service topological path can be very complex. At the same time, with the popularity of container technology and the fact that containers make it easy for every service to be written and run in another language, the library-style approach is becoming more and more limited at this point.

Such complexity and critical requirements increasingly require a dedicated layer of inter-service communication that is separate from the application code and captures the highly dynamic resilience of the underlying environment. This layer is the service grid we need.

Service brokers can help us add important functionality to the cloud environment service architecture. Each application can have its own requirements or configurations to understand how agents behave when given their workload targets. With an increasing number of applications and services, configuring and managing a large number of agents can be difficult. In addition, using these agents in each application instance provides opportunities to build rich advanced functionality that we would otherwise have to perform in the application itself.

The service broker forms a netted data plane through which traffic between all services is processed and observed. The data plane is responsible for establishing, protecting, and controlling the flow through the grid. The management component responsible for how the data plane performs is called the control plane. The control plane is the brain of the grid and provides a public API for grid users to manipulate network behavior.

Istio service grid

Istio is an open platform for connectivity/management and secure microservices. It provides an easy way to create microservice grids and provide load balancing, inter-service authentication, and monitoring capabilities. The key point is that you don’t need to modify too many services to achieve these capabilities. Istio itself is an open source project that provides a consistent way to connect, harden, manage, and monitor microservices, originally an open source implementation of a network of services created by Google, IBM, and Lyft. Istio helps you add resiliency and observability to your service architecture in a transparent way. With Istio, applications do not have to know that they are part of the service grid. Istio handles network traffic on behalf of the application whenever it interacts with the outside world. This means that if you’re doing microservices, Istio can bring a lot of benefits.

Istio provides the following functions:

  • Traffic management, which controls the flow of calls between services and API calls to make calls more reliable and make the network more robust in bad situations;
  • Observability, capturing the dependencies between services, as well as the traffic direction of service invocations, providing the ability to quickly identify problems;
  • Policy execution controls access policies of services without changing the services themselves.

Service identity and security, providing verifiable identities for services in the grid and the ability to protect service traffic so that it can flow across networks with different levels of confidence.

Istio’s first production-usable version 1.0 was officially released on July 31, 2018, and version 1.1 was released in March 2019. The community then followed a rapid iterative approach, releasing 10 mini-releases in three months. As of the end of the book, the community has released version 1.4.

Istio’s data plane defaults to using Envoy proxies, which, out of the box, help you configure your application to deploy instances of service proxies alongside it. Istio’s control plane consists of components that provide o&M apis, broker configuration apis, security Settings, policy declarations, and more for end users and o&M personnel. We will describe these control plane components later in the book.

Istio was originally built to run on Kubernetes, but the code was implemented from a deployment platform-neutral perspective. This means you can leverage isTIo-based service grids on deployment platforms such as Kubernetes, OpenShift, Mesos, and Cloud Foundry, and even deploy Istio environments on virtual machines, physical bare metal machines. In later chapters, we’ll show how powerful Istio is for hybrid deployments of cloud combinations, including private data centers. In this book, we will prioritize deployment on Kubernetes, with more advanced chapters covering virtual machines and so on.

Istio means “to set sail” in Greek, while Kubernetes can be translated as “helmsman” or “pilot” in Greek. So from the beginning Istio wanted to work well with Kubernetes to efficiently run the distributed microservices architecture and provide a unified approach to secure, connect, and monitor microservices.

With service proxies next to each application instance, applications no longer need to have language-specific elastic libraries to enable fuses, timeouts, retries, service discovery, load balancing, and so on. In addition, the service broker handles metrics collection, distributed tracing, and log collection.

Because traffic in the service grid flows through Istio service agents, Istio has control points in each application to influence and direct its network behavior. This allows service operations to control routed traffic and achieve fine-grained deployment through Canary deployment, Dark Launch, hierarchical rollout, and A/B testing. We will explore these features in a later section.

The core function

Istio provides many key functions across the service network, including traffic management, security, observability, platform support, integration, and customization.

1. Traffic management

Through simple rule configuration and traffic routing, Istio can control traffic and API calls between services. Istio simplifies the configuration of service-level properties such as fuses, timeouts, and retries, and makes it easy to set up important tasks such as A/B testing, Canary deployment, and phased deployment based on percentage traffic splitting.

Istio has out-of-the-box failover capabilities that allow you to detect problems before they occur and optimize to make calls between services more reliable.

2. The security

Istio has powerful security features that allow developers to focus on application-level security. Istio provides the underlying secure communication channel and manages authentication, authorization, and encryption of service traffic on a large scale. With Istio, service communication is secure by default, allowing policies to be enforced consistently across multiple protocols and runtimes, all of which, crucially, require little or no application changes.

While Istio is platform-independent, when used in conjunction with the Kubernetes networking strategy, it has greater advantages, including the ability to secure pod-to-POD or service-to-service communication at the network and application layers. Later sections will describe how to combine network policy with Istio in Kubernetes to protect services together.

3. Observability

Istio’s powerful tracking, monitoring, and logging capabilities give you insight into service grid deployment. Istio’s monitoring capabilities allow you to truly understand how service performance affects both upstream and downstream functions, while its custom dashboards provide visibility into the performance of all services and how that performance affects other processes.

Istio’s Mixer components are responsible for policy control and telemetry collection, providing back-end abstractions and mediations that isolate the rest of Istio from the implementation details of individual back-end infrastructures, and provide operations personnel with fine-grained control over all interactions between the grid and the back-end infrastructure.

All of these capabilities allow you to more effectively set up, monitor, and enforce service level target slOs on your services. Of course, the most important thing is that problems can be quickly and efficiently detected and fixed.

4. Platform support

Istio is platform-independent and aims to run in a variety of environments, including cross-cloud, internal deployment, Kubernetes, Mesos, and more. You can deploy Istio on Kubernetes or Nomad with Consul. Istio currently supports:

  • Services deployed on Kubernetes;
  • Using Consul registered services;
  • Services running on each VIRTUAL machine.

5. Integration and customization

Istio’s policy enforcement components can be extended and customized to integrate with existing solutions for ACLs, logging, monitoring, quotas, auditing, and more.

In addition, starting with version 1.0, Istio supports MCP (Mesh Conf? Iguration Protocol (Grid Configuration Protocol) for configuration distribution. With MCP, it is easy to integrate external systems, such as an MCP server that you can implement yourself and then integrate into Istio. The MCP server provides the following two main functions:

  • Connect and monitor external service registration systems to get the latest service information (e.g. Eureka, ZooKeeper, etc.);
  • Convert the external service information into Istio ServiceEntry and publish it through the MCP resource.

Why Istio

Istio addresses the challenges faced by developers and operations personnel in the transition from monolithic applications to distributed microservices architectures. As service grids grow in size and complexity, they become increasingly difficult to understand and manage, with requirements ranging from service discovery, load balancing, failover, metrics collection and monitoring to more complex operations such as A/B testing, Canary publishing, traffic limiting, access control, and end-to-end authentication. Istio provides a complete solution to meet the diverse needs of microservice applications by providing behavioral insights and operational control for the entire service grid.

Istio provides a simple way to network deployed services with load balancing, authentication between services, monitoring, and so on, with little or no change to the service code. To enable Istio for services, you simply deploy a special Sidecar agent in your environment that uses the Istio control plane to configure and manage agents and intercept all network traffic between microservices.

In addition, the enterprise Service Bus (ESB) of a service-oriented architecture (SOA) has some similarities to a service grid in that the ESB is transparent to application services in an SOA architecture, meaning that applications are unaware of it. Similar behavior can be achieved by a service grid, which should be transparent to applications and, like an ESB, simplify calls between services. Of course, the ESB also includes things like mediation of interaction protocols, message transformation, or content-based routing, while the service grid is not responsible for all the functionality that the ESB does. The service grid does provide resilience to service requests through retries, timeouts, and fuses, as well as services such as service discovery and load balancing. Complex business transformations, business process choreography, business process exceptions, and service choreography capabilities are not the domain of service grid. In contrast to the centralized system of an ESB, the data plane in the service grid is highly distributed and its agents coexist with applications, eliminating the single point of failure bottlenecks that often occur in ESB architectures.

Of course, it’s also important to be aware of the problems that service grids don’t solve. Service grid technologies like Istio typically provide powerful infrastructure capabilities that can touch many areas of distributed architecture, but certainly won’t solve every problem you might encounter. The ideal cloud architecture separates different concerns from each layer of implementation.

At the lower level of the infrastructure, there is a greater focus on infrastructure, how to provide infrastructure capabilities for automated deployment. This helps to deploy code to a variety of platforms, whether container, Kubernetes, virtual machines, etc. Istio does not define which deployment automation tool you should use.

At the higher business application level, application business logic is a differentiating asset for an enterprise to maintain its core competitiveness. These code assets cover services that have a single business function and need to be invoked, in what order they are executed, how to perform their interactions, how to aggregate them together, and what to do in the event of a failure. Istio does not implement or replace any business logic, does not perform service choreography itself, does not provide content transformation or enhancement for business workloads, does not split or aggregate for workloads. These functions are best left to libraries and frameworks in your application.

The following figure shows separation of concerns in a cloud native application, where Istio supports the application layer and sits above the lower-level deployment layer. Istio acts as the link between the deployment platform and the application code. Its role is to facilitate the extraction of complex network logic from applications that can perform content-based routing based on external metadata as part of the request, such as HTTP standards. Fine-grained traffic control and routing is also possible based on metadata matching of services and requests. You can also secure transport and unload security token validation, or enforce quotas and usage policies defined by service operators.

It is critical to understand Istio’s capabilities, similarities to other systems, and its place in the architecture to help us make the same mistakes we might have encountered with promising technologies in the past.

Maturity and level of support

The Istio community defines different functional phases based on the relative maturity and support level of each component function. Alpha, Beta, and Stable are used to describe their respective states, as shown in Table 1-1.

Table 1-2 is an excerpt of Istio 1.4’s existing features that have reached the Beta and Stable stages. This information will be updated after each release, please refer to the official website for the updated status.

Of course, some Istio functions are still Alpha, such as the Istio CNI plug-in, which can replace the ISTIO-Init container to perform the same network functions without requiring Istio users to apply for Kubernetes RBAC authorization. And the ability to use custom filters in Envoy and so on. It is believed that in the subsequent improvement, these functions will gradually become more stable and productive.

conclusion

Istio is currently the most popular implementation in the industry’s service grid space. Its capabilities allow you to simplify the operation and operation of cloud native service architecture applications in a hybrid environment. Istio allows developers to focus on building service functionality in their preferred programming languages, which effectively increases developer productivity and prevents developers from mixing code that solves distributed system problems into business code.

Istio is a fully open development project with a vibrant, open and diverse community. Its goal is to empower developers and operations staff to publish and maintain microservices in all environments with agility, complete visibility into the underlying network, and consistent control and security capabilities. In the rest of the book, we will show how to leverage Istio’s capabilities to run microservices in the cloud native world.

This article is excerpted from Istio Service Grid Analysis and Practice. This book is written by Ali Cloud senior technical expert Wang Xining, detailed introduction of the basic principles of Istio and development combat, including a large number of selected cases and reference code can be downloaded, can quickly start Istio development. Gartner believes that by 2020 service grid will be the standard technology for all leading container management systems. This book is suitable for all readers who are interested in microservices and cloud native, and we recommend you to read this book in depth.

Recommended reading:

[1] Ali Cloud Service grid ASM public test one of the series: quickly understand what is ASM

The article links: yq.aliyun.com/articles/74…

[2] The extension capability of ASM in Ali Cloud Service grid (1) : Add HTTP request headers in ASM through EnvoyFilter

The article links: yq.aliyun.com/articles/74…

“Alibaba Cloud originator focuses on micro-service, Serverless, container, Service Mesh and other technical fields, focuses on the trend of cloud native popular technology, large-scale implementation of cloud native practice, and becomes the public account that most understands cloud native developers.”