OCTO is the internal service governance platform of Meituan, including service communication framework, naming service, service data center, user management platform and other components, providing a complete set of service governance solutions and unified service governance experience for all services in the company. In previous articles, we covered the construction of the OCTO platform from different perspectives. Include:

  • “Challenges and Evolution of Meituan Naming Service” introduces the original intention, implementation scheme and implementation effect of MNS from 1.0 to 2.0. It also introduces the important value of MNS as a technology platform component to business and some achievements in promoting business upgrading.
  • Technical Analysis of OCTO Trillion-level Data Center Computing Engine by Meituan introduces the evolution and architecture design of OCTO data center computing engine developed by meituan, and introduces in detail the various technical solutions used to comprehensively improve computing capacity, throughput capacity and reduce operation and maintenance costs.
  • Open Source OCTO Core Components of Large-scale Micro-service Communication Framework and Governance System Introduces the open source situation of OCTO core components, SUCH as OCTO-RPC (JAVA Communication framework), OCto-NS (naming service) and Octo-Portal (user management platform).
  • “Falling down in Complex Environments: Challenges and Practices of Service Mesh” introduces the development history, dilemma and optimization ideas of current Service governance systems from a higher perspective. The implementation of Service Mesh in large-scale complex Service scenarios is deeply analyzed.

This article will continue the work of OCTO system in Service Mesh evolution. From the point of view of data, the design ideas of each technical scheme are introduced in detail.

1 Overall Architecture

Let’s take a look at the overall architecture of OCTO 2.0, as shown below:

Infrastructure refers to the existing service governance system OCTO1.0, including **MNS**, KMS (Authentication management Service), MCC (Configuration Management Center), Rhino (fusing traffic limiting service), and so on. These systems are plugged into the OCTO 2.0 control plane to avoid unnecessary costs introduced by excessive refactoring.

The OCTO 2.0 control plane does away with community Istio and is completely self-developed. The data plane is based on the open source Envoy Transform. The o&M system is responsible for the upgrade and release of data plane components. For more information about the overall selection and reasons, please refer to the article: Exploration and Practice of OCTO2.0, meituan’s next generation Service governance System.

OCTO 2.0 functions are divided into two dimensions: Mesh and service governance. Here, we will focus on several common problems such as traffic hijacking, non-destructive restart and service routing, and explain how Meituan solves these problems by combining the existing relatively complete service governance system.

2 Mesh function

2.1 Traffic Hijacking

Instead of Istio’s native solution, OCTO 2.0 uses Iptables to hijack traffic in and out of pods. The following two factors were considered:

  1. Iptables suffers from high performance loss and poor management:
  • Iptables defines five hook points during packet processing by the kernel. Each Hook point corresponds to a set of rule chains. Outbond traffic traverses the protocol stack twice and is matched by these five rule chains.
  • Iptables takes effect globally and cannot explicitly prohibit the modification of related rules. There is no related ACL mechanism and the control is poor.
  1. There are several problems with using iptables:
  • The HULK container is a rich container in which business processes and all other basic components are in the same container. These components use a variety of ports, and iptables is prone to false interception.
  • Meituan has multiple service running scenarios, such as physical machines, VMS, and containers. The iptables-based traffic hijacking scheme has high complexity in these scenarios.

In view of the above problems, the direct connection mode of Unix Domain Socket is finally adopted to realize traffic forwarding between the service process and OCTO-proxy.

On the service consumer side, the service process establishes the connection through the lightweight Mesh SDK and the UDS address monitored by the OCTO-proxy. On the service provider side, octo-proxy listens on the TCP port for the service process, and the service process listens on the specified UDS address.

The advantage of this scheme is that Unix Domain Socket has better performance and lower operation and maintenance cost compared with IPtable hijacking. The disadvantage is that the Mesh SDK is required.

2.2 Service Subscription

The native CDS/EDS request is a full-service discovery pattern, in which the entire list of services in the system is requested into the data facet. Because there are so many services in a large cluster that only a few services are really needed, the on-demand service discovery pattern needs to be transformed to request only a node list of back-end services that need to be accessed.

After the service process is started, you need to send an OCto-proxy subscription request through HTTP. Octo-proxy updates the requested back-end service Appkey to XDS, which then requests specific service resources from the control plane.

In order to increase the robustness of the whole process and reduce the subsequent operation and maintenance costs, we made partial adaptation. For example, the startup speed of octo-proxy may be slower than that of service processes, so the request retry logic is added in the Mesh SDK. The HTTP requests between the Mesh SDK and ocTO-proxy are transformed into synchronous requests to prevent problems caused by Pilot resource delivery delay. The subscription information of the Mesh SDK is also saved in a local file for hot or faulty octo-Proxy restart.

2.3 Hot Restart without Damage

2.3.1 Traffic Loss Scenario

How to provide continuous and uninterrupted services in the upgrade process of basic components without loss of service traffic and service awareness is a major challenge faced by all basic components. This problem is more important for octo-proxy components, which are located in the core link of traffic. The community native Envoy itself already supports hot restarts but not completely, and still can’t be completely traffic-free in some scenarios.

The following figure shows the traffic loss of octo-Proxy hot restart in short connection and long connection scenarios respectively.

For short connections, all new connections are created on the new OCto-proxy, and the existing connections on the old OCto-proxy are automatically disconnected when the response arrives. All short connections from the old Octo-proxy are gradually disconnected, which is what happens when ‘Drain’. After the connection is empty, the old OCto-proxy exits and the new OCto-proxy continues to work. The whole process of flow, completely lossless.

For the long connection mode, the SDK and the old Octo-proxy maintain a long connection and continue to use this connection to send requests. When the old Octo-proxy process finally exits, the connection is disconnected passively. In this case, some responses may not be returned, resulting in Client request timeout. Thus, Envoy’s hot restart support for long-connected scenes is not perfect.

To support the uninterrupted service of the underlying components during upgrades, Rolling Update is currently the industry’s dominant approach: servers are phased out of service, updates are performed, and then put back to use until all instances in the cluster are updated to the new version. During the upgrade, service traffic is removed to ensure that service traffic is not lost during the upgrade.

To be compatible with physical machines, virtual machines, and containers, K8s rolling upgrade is not suitable for the cloud industry. Therefore, in the current environment, how to ensure the high availability of service governance system and high reliability of services is a very difficult matter.

2.3.2 Adaptation Scheme

The current scheme divides business services into two roles, namely, the Server role that provides services externally and the Client role that initiates requests externally, and adopts different hot update support for the two roles.

Client OCto-proxy hot update: After the old OCto-proxy enters the hot restart state, it directly returns the response protocol with the “hot restart” flag to the subsequent “new request”. When the Client SDK receives the response protocol with the “hot restart” flag, The new connection should be actively switched and requested to be retried (to avoid the problem that the old connection held by the current Client SDK was actively closed after the old OCto-Proxy hot restart process was completed). Note that the Client SDK needs to “properly” handle any “reply” protocols left over from the link where the old connection resides.

The interaction between the Client SDK and ocTO-Proxy ensures traffic security during octo-Proxy upgrade.

Server octo-proxy hot update: After hot restart starts, octo-proxy on the Server sends a ProxyRestart message to octo-Proxy on the Client to inform octo-Proxy on the Client to initiate a new connection. This prevents the old connection held by octo-proxy on the Client side from being forcibly closed after the hot restart process of octo-Proxy on the Server side is complete. Client OCto-proxy removes all remaining ‘reply’ protocols on the old link as soon as it receives the ‘active New link switchover’ request and properly processes all remaining ‘reply’ protocols on the old link (for conservatism, it marks the link as unavailable and maintains it for a period of time, for example, octo-Proxy defaults to drainTime).

2.4 Data surface OPERATION and maintenance

2.4.1 LEGO Operation and Maintenance Scheme

In cloud native environments, envoys run in standard K8S pods and typically produce a Sidecar container in its own right. You can use the capabilities provided by K8s to manage Envoy Sidecar containers, such as container injection, health checks, rolling upgrades, resource limits, and so on.

The container runtime mode used internally by Meituan is “single-container mode”, which means that only one container (regardless of Pause containers) is contained within a Pod. Because business processes and all basic components run in a container, only process-granularity management measures can be adopted, but not container-granularity management. We solved the operation and maintenance problem of OCTO-Proxy through our self-developed LEGO platform.

We have customized lego-Agent and added support for the octo-Proxy hot restart process. In addition, Lego-Agent is responsible for health check, fault status restart, monitoring information reporting, and version release of OCto-Proxy. Compared to the container restart of native K8S, the process granularity restart is much faster.

Another advantage of the LEGO system is that it can support multiple runtime environments, including containers, virtual machines, and physical machines.

2.4.2 Cloud native O&M solution

At present, is also exploring OCTO – Proxy cloud native operational scheme, under this scenario is the biggest difference from the process granularity operational transformation to the container size operations, and business application did better decoupling, at the same time can enjoy the dividends of the immutable infrastructure such as concept, but this scenario brings a problem is how to manage a warm restart of the container size.

Therefore, the self-developed Operator is used to perform operation and maintenance management of the octo-Proxy container in its full life cycle, and it is expected to perform hot restart of octo-Proxy through Operator. The specific process is as follows:

However, this scheme is problematic in the implementation process, because the underlying design of K8s does not support the operation of adding and deleting containers to the Pod in operation. If this scheme is to be implemented, the underlying components of K8s need to be customized and modified, which will bring some risks and incompatibility with the community. In order to ensure compatibility with the community, we modified the hot restart scheme and finally used the dual container resident hot restart scheme:

  1. First of all, when Pod is started, octo-proxy is assigned two containers instead of one container, one of which is in the normal state and the other is in the standby state.
  2. When hot restart of octo-proxy is required, the image of the standby container is changed to the image of the latest OCto-proxy, and the hot restart process starts.
  3. When the hot restart process ends, the new container enters the normal working state, while the old container enters the waiting state. Finally, we change the image of the old container to the standby image to end the hot restart process.

In this scheme, the dual containers reside in Pod at the beginning, avoiding K8s restriction that Pod cannot add or delete containers in operation, and realizing hot restart of OCto-proxy in Pod with container granularity.

3 Future Planning

There are currently thousands of services connected to OCTO2.0 systems and billions of traffic per day. As more and more services are connected, higher requirements are placed on stability and o&M capabilities. How to understand the health status of online OCto-proxy and the business system in detail, and how to monitor and quickly recover in case of failure are the focus of our recent OCTO2.0 system construction. In addition, the planning highlights of OCTO2.0 in the future include:

  • Operation and maintenance, release and stability construction will be explored in the direction of Cloud biotechnology.
  • Support the meshing of internal Http services, reducing the dependence on centralized gateways.
  • Full link mTLS support.