Previously, I went to QCon shenzhen 2020 to watch and listen to what big manufacturers are doing now. The mainstream of the backend is discussing cloud native, in which Meituan shared their practice in Service Mesh architecture, which I think is very valuable, so I wrote a blog to record it
Evolution of Service Mesh
A service since the beginning of the project, is usually a simple single application, an application provides all the necessary services, all modules are included in this application, as demand iteration, the module number of the application is more and more large, functional integration more and more, in the application of strong coupling between modules and modules, As a result, one small change at a later stage could affect the entire application and lead to online accidents
Development at this stage, the general will split in services, from the business, put the complete application of similar functions in a group, usually split between the business and the business will have a need to call each other, based on the original single application between the module and module under the direct call now RPC calls between service and the service, based on this premise, At the network level, several higher-dimensional problems arise that cannot be supported by the current OSI 7-layer network model
Service registration and service discovery
Between services to achieve mutual calls, you must first find each other, similar to the DNS lookups, by the name lookup to specific endpoints (IP + port number), so first need a similar DNS service, ACTS as a registry functions above request a service or destroy the above registration center to be able to perceive the events, so that in the second step: When a service is discovered, it can accurately obtain the available destination endpoint. The usual process is as follows:
- You are called to register your address with the registry
- Call direction of registry to ask modulated service address list (3) on was the most direct way to initiate the request, is to have a DNS server, used to act as a registry, but the DNS is often a cache, update not real-time, and service registry in addition to the “registration” this action, you also need to have “destruction” of action, or hang a node, DNS still retains the information of this node, and the caller will fail to initiate the call. In general, DNS is not perfect for this work, so it needs to find another way to solve it
Load balancing
Above you can see, the modulated party can have multiple instances of the service, distribution running on different hosts, in the real world in order to avoid a single point of failure, will decorate basically multiple instances to provide services at the same time, this leads to the caller to the service address is not one, but a group of list, and how to select from the list of this group of an endpoint to initiate the request? This requires the caller to have load balancing capabilities, such as RR polling and random access
Current limiting fuse
Micro service clusters, each service is responsible for different functions, a request in May through multiple services, and if in the midst of this, some micro service performance bottlenecks, slow service, such as some processing after amount of concurrent requests come up, may be directly is hang up, and the service is a critical path, leading to the front all requests an error, There is a need for a protection mechanism that can degrade steps when appropriate, such as fusing a service after it has reached a performance bottleneck, denying access to the service portal, and adding the service when it is available again
security
In an untrusted cluster, encrypted communication between services may also be required, which means identity authentication and key exchange between the two parties. Identity authentication requires an authoritative security center to issue certificates, or middlemen may eavesdrop. This depends on the security level required by services
These problems, collectively referred to as service governance, basically service governance through three stages
The first phase includes governance logic in the application
At this stage, in addition to the business code, the application itself also adds the code of the governance logic part, and the application itself solves the problems of service discovery, load balancing and so on. This approach has the following problems:
- The strong coupling between business code and governance logic means that developers must also care about governance issues while implementing the business, rather than developing pure business code
- There is a lot of duplicate code, and the business code of all services in the cluster has to implement such a set of governance logic, which is difficult to maintain, or even all services have to change once a bug occurs
In the second stage, separate the governance logic SDK
In order to solve the coupling problem, the business tries to abstract these logic into an SDK library, which is called by the business code, which is like a black box for the business. In general, the code logic can achieve the separation of business and governance, but the introduction of SDK brings new problems
- The language is strongly bound, the SDK needs to be adapted to the language, and each language has to implement a set of SDKS with the same functionality
- SDK versions are not easy to manage, and the SDK itself will have an iterative process of upgrade. Business development is usually inert, so it is not expected that businesses will take the initiative to upgrade the SDK. As a result, there will be various VERSIONS of SDK in the cluster, and their functions are not uniform, which is difficult for SDK maintainers to manage
- There is still code-level intrusion. Although it is said to be independent of SDK, business code itself needs to be adapted with SDK. In this case, service governance cannot be completely transparent to business
In the third stage, separate the governance logic process
From the first stage and the second stage of the experience can be concluded that, if you want to make business to play cool, it is must try to make service governance logic constantly sinking, best can realize completely transparent to the business, for example, business doesn’t need any special logic, directly to a service launched a HTTP request, and then obtain the expected return this is the best
, for example, as the development of TCP, from the earliest time, TCP/IP has not been worked out, the connection between the machine and machine may be directly directly connected through A network (or even no such thing as A string, just with A RS232 serial port connection), A data stream from the machine received directly by the machine B, then, More remote communication, network problems begin to emerge, such as data may delay, may be lost package, package sequence may be messy, even if the two machines performance, desperately send data at one end, side to collect data, and so on, then you need a strategy to determine to solve the following questions at least
- Both sides need to have a response logic to determine whether the peer end has received the data at all
- There needs to be a timeout restart mechanism to confirm that packet loss has not occurred in the network, so people cannot wait indefinitely
- There needs to be a congestion control mechanism to determine how fast the sender can send data to the other side
- Ensure the sequence of packets
Isn’t this a lot like the problem TCP solves? Yes, because the demand of communication between the machine is too common, so made the TCP/IP protocol, and put him into a network communication infrastructure, the business layer is no longer need to go to solve the problem of this kind of network governance, it is actually a kind of logical levels of sinking, business continued to upper development, infrastructure, sinking to the details of the upper block management
Is there a chance that the service governance logic of the microservices era will sink into the infrastructure so that the application layer doesn’t need to care about these things? I don’t see any further elevation in the OSI 7 layer model so let the governance logic sink artificially, masking the underlying details from the business layer as much as possible
This is where phase 3 is headed: governing logic-independent processes
By separating the code of the governance logic into a process, the process is responsible for hijacking the inbound and outbound traffic of the service process and implementing the service governance logic. For the service process, it is completely unaware of the existence of the governance process. The bypass process is SideCar, also known as proxy
If we manage all proxies in the cluster, the communication between services is completely responsible for by proxy, and a control center is responsible for controlling these proxies to realize a Service governance network, which is called Service Mesh
The Service Mesh solves the following problems:
- Completely transparent to the business layer, the grid is not sensed
- The control center on the grid and the proxy on the application node jointly complete service governance, including service registration, service discovery, service fusing, load balancing and other functions, and even realize advanced functions such as dynamic route distribution, monitoring, and call chain tracking
Istio architecture
Istio is a popular solution in the form of Service Mesh. It is launched by Google, so Istio can deeply integrate with K8s. Since it is from the same background, Istio tries to reuse K8s if it can rely on the functions of K8s. Build a Service Mesh based on the K8s ecosystem. Of course, Istio can also be integrated with the basic platform, but it works best with K8s. The following is the picture on Istio’s official website
The following four functions are summarized
Traffic management
- Realize load balancing, dynamic routing, and support grayscale publishing, fault injection and other advanced governance capabilities
observability
- Basic access logs and various monitoring of traffic dimensions are basic functions. In addition, call chain tracing is supported. Istio can give span trace ids to requests that flow through ProX, and Istio can then connect to a tracing backend like Zipkin. Call chain tracing can be implemented immediately without the need for the business layer to intervene in development
Policy enforcement
- Access policies such as fusing and current limiting can be implemented
security
- Communications security
Istio supports almost all of the service governance requirements addressed in microservices
Like K8s, Istio has two layers: a control plane and a data plane
The data facet automatically injects Istio’s Prox container (using an Envoy component) into a Pod at service creation. This injection is insensitive to the business layer, and the Proxy container hijacks incoming and outgoing traffic from the Pod’s business container via iptables forwarding
As for the control surface, Istio’s control center is composed of Pilot (controller), Citadel (security-related functions), and Galley (various configurations). In the old version, there was also a Mixer component that was mainly responsible for telemetry acquisition. The proxy reported various indicators to the Mixer. Mixer is used to control the strategy. Due to performance bottlenecks, this component has been abandoned in the new VERSION of Istio, and related capabilities are delegated to Proxy on the data side
For K8s, Istio is a kind of function complement to K8s in the form of plug-in. K8s itself only realizes the basic layer such as operation and maintenance deployment and basic service discovery, while Istio is an intermediate platform layer above and below the business layer, realizing the above service governance capabilities
Istio service governance rules
Four logical models are provided
- Gateway: Serves as the Gateway of a cluster. External traffic accesses internal services through this Gateway
- VitualService: Routing rules, such as routing by path, routing by Http cookies, etc., can be seen in the server configuration under Nginx, you can have various location matching rules
- DestinationRule: The rules of the access policy, such as filtering target nodes and setting load balancing mode, can be interpreted as upstream configuration under Nginx. Server. location matches the rules and sends requests to specified upstream. The DestinationRule is equivalent to the various control rules under upstream, such as defining the list of back-end services, the maximum number of connections, etc
- ServiceEntry, which is the egress gateway for a service in a cluster to request an external service. Of course, you can also use the Proxy to initiate a connection without this thing
The above, based on VitralService+DestinationRule rules, can achieve various access forms, such as blue and green deployment, Canary deployment, A/B test, etc
Istio provides monitoring capabilities. The backend can be connected to various monitoring systems, such as Promethus, to monitor alarms without the intervention of the service layer
Istio’s service discovery relies on K8s. Isito listens for events from KubeAPIServer via list/wath, so Istio does not need to implement service registration
Finally, I have been reading a book about Istio published by Huawei Cloud recently, and I will continue with more detailed content when I finish reading it. I was interrupted by work tonight when I was writing, so I have no idea. So let me pause for a moment, and I will introduce Istio here first
During the learning process, I referred to the following materials: Pattern: Service Mesh Istio entry-level training – huawei cloud CloudNative series training course CloudNative Service grid Istio
My original post is on medium.com/heshaobo20…