This article is the third operation and maintenance chapter of Ant Financial Service Mesh Large-scale Implementation Series. This series will analyze the large-scale implementation practice of Service Mesh on November 11 in detail from the modules of core, RPC, message, wireless gateway, control surface, security, operation and maintenance, and test. Previous articles in the series are included at the end.

The introduction

Service Mesh is the core of Ant Financial’s next-generation architecture and an important part of ant Financial’s internal evolution toward cloud native. This article is the operation and maintenance article of Service Mesh series, written by Huang Jiaqi (Name: Jiaqi, ant Financial operations specialist, Service Mesh SRE, focuses on cloud native infrastructure, middleware, and Service Mesh stability, and is Pythoner, SOFA – Bolt-Python author.

This paper will mainly share the challenges and evolution of large-scale service grid from the perspective of operation and maintenance in the process of ant Financial’s current volume falling to support ant Financial’s double eleven Promotion. The content includes the selection and problems of cloud biobiotics, the challenges to resource models, the evolution of large-scale down-operation facilities, and the construction of surrounding technical risk capabilities.

Service Mesh has been applied and implemented on a large scale in 2019. Up to now, Ant Financial’s Service Mesh data plane MOSN has been connected to hundreds of applications and the number of access containers reaches hundreds of thousands, making it the largest known Service Mesh cluster in the world. At the same time, the performance of Service Mesh was also very impressive in the just concluded double eleven, with the peak RPC QPS reaching tens of millions and the peak message TPS reaching millions. After the introduction of Service Mesh, the average RT growth was controlled within 0.2ms.

Embrace cloud native

In terms of software form, Service Mesh separates middleware capabilities from the framework into independent software. In terms of deployment, the conservative approach is to co-exist with business processes in the business container as separate processes. In ant Financial, we chose to embrace cloud native from the very beginning.

Sidecars mode

The advantages of an independent process in a service container are that it is compatible with the traditional deployment mode and can be brought online quickly. However, independent processes invade business containers and are more difficult to manage in mirrored containers. On the other hand, cloud protogenics can decouple the operation and maintenance of the Service Mesh from the business container to realize the sinking of the middleware operation and maintenance capability. In the business image, only long-term stable JVM parameters related to the Service Mesh are retained, thus the Service Mesh is joined with only a few environment variables. At the same time, considering the evolution of container-oriented OPERATION and maintenance mode, adding Service Mesh also requires services to be mirrored, laying a foundation for further cloud native evolution.

optimal bad
Independent process Compatible with traditional deployment modes; Low transformation cost; Rapid on-line Intrusion into business containers; Mirroring is difficult to operate and maintain
Sidecar End state oriented; Operations of decoupling Rely on K8s infrastructure; High cost of environmental transformation in operation and maintenance; The application needs to be mirrored

After the Service Mesh is connected, a typical POD structure may contain multiple SidecArs:

  • MOSN: RPC Mesh, MSG Mesh… (expanding);
  • Other sidecars;

MOSN:github.com/sofastack/s…

These Sidecar containers share the same network Namespace with Service containers, enabling Service processes to access services provided by the Service Mesh through local ports, ensuring the same experience as the conservative approach.

Infrastructure Cloud native support

We are also moving forward with a simultaneous cloud native transformation at the infrastructure level to support the landing of Service Mesh.

Services are fully mirrored

First of all, we promoted the comprehensive mirror transformation inside Ant Financial. We completed the mirror transformation of the full container of the internal core application. Renovation points include:

  • Added support for environment variables of the Service Mesh at the base image level.
  • Dockerfile ADAPTS to the Service Mesh.
  • Promote the mirror transformation of the static file which solves the problem of separating the management of the front and back end of the stock;
  • Promoted the push, change and pull transformation of the applications that use a lot of front-end block distribution;
  • Large volume of VM mode container upgrade and replacement;

Container POD,

In addition to the transformation at the service mirroring level, Sidecar mode also requires all service containers to run on POD to adapt to multi-container sharing network. Due to the high development and trial and error costs of the direct upgrade, we chose to replace all the tens of thousands of non-K8S containers for hundreds of applications connected to the Service Mesh with K8s PODs through massive scaling.

After these two rounds of transformation, we simultaneously completed the transformation for cloud native at the infrastructure level.

Resource Evolution

The Sidecar pattern raises an important question of how to allocate resources.

The assumption of ideal proportions

The original resource design was based on the reality that memory could not be oversold. We made an assumption:

  • The assumption that the MOSN base resource usage is in proportion to the service selected specifications.

CPU and Memory request additional resources in proportion to the business container. The ratio was finally set at 1/4 CPU and 1/16 Memory.

The resource allocation for a typical Pod at this point is shown as follows:

This approach poses two problems:

  1. Ant Financial has realized the Quota control of Service resources, but Sidecar is not in the Service container. The Service Mesh container becomes a resource leakage point.
  2. The Service Mesh container of some high-traffic applications has insufficient memory and OOM status.

The imperfection of a perfect partition

In order to quickly support the rollout of Service Mesh in a non-cloud environment, the Service Mesh is connected in situ. However, the resources that are connected to the Service Mesh in place cannot be allocated. When the memory cannot be oversold, the secondary partition is adopted for allocation. The POD memory resources are divided into 1/16 memory for the Sidecar and 15/16 memory for the service container. In addition to the above two problems, some new problems also arise:

  • Service Visible memory inconsistency, service monitoring deviation, service process OOM risk.

After discussion, we added a hypothesis:

  • The resources occupied by the Service Mesh container are the resources used by services before the Service Mesh is added. The process of adding Service Mesh is also a resource replacement.

Shared

Based on this assumption, resource oversold within POD is promoted at the scheduling level. The new resource allocation scheme is shown in the following figure. CPU and MEM of the Service Mesh container are oversold from POD, and all resources can still be seen in the Service Mesh container.

In consideration of the risk of POD OOM due to oversold memory, OOM Score is adjusted for Sidecar containers to ensure that the Service Mesh process can start faster than The Java Service process in case of insufficient memory.

The new distribution solution solves both of these problems and supports the multi-round pressure test smoothly before the big push.

The reconstruction

However, when the new distribution scheme comes online, the Service Mesh has already come online during the elastic site creation. At the same time, we also found that in some scenarios, the Service Mesh container could not preempt CPU resources, leading to the jitter of Service RT. The reason is that in CPU Share mode, POD does not allocate CPU Quota equal to Sidecar by default.

Two questions remain:

  • The allocated Sidecar is still OOM risk.
  • Sidecar cannot preempt CPU

We can no longer afford to replace all the PODS. Finally, with the support of scheduling, all resources are reallocated in Pod through manual recalculation and modification of Pod annotations to fix these two risks. The total number of repaired containers was about 25W.

Operational and maintenance challenges of change and scaling

Changes in the Service Mesh include access and upgrade. The bottom layer of all changes is the Operator component to accept the identification written by the upper layer into the POD annotation and modify the corresponding POD Spec, which is a typical cloud-native approach. Due to the current resource situation and operation and maintenance needs of Ant Financial, in-place access and smooth upgrade are developed. Details related to Operator will be described in the Operator section, please continue to pay attention to this official account.

Access to the

The original Service Mesh access only provided the creation time injection Sidecar. The reason for introducing in-place access later is to support fast access and rollback on a large scale.

  • Create access:
    • The resource replacement process requires a large number of buffers;
    • Rollback difficulty;
  • In-place access:
    • There is no need to reallocate resources;
    • Can roll back in place;

In-place access/rollback requires refined modifications to the POD Spec, and a number of problems have been found in practice, with only limited testing of the current capability.

upgrade

Service Mesh is deeply involved in Service traffic, so the original Sidecar upgrade required Service restart. In this seemingly simple process, we encountered a serious problem:

  • The container startup sequence in the Pod is random, so services cannot be started.

This problem ultimately depends on the scheduling layer to modify the startup logic, POD needs to wait for all sidecars to complete the startup, which leads to the second problem:

  • Sidecar started slowly. Upper layer timed out.

The problem is still being solved.

In Sidecar, the MOSN provides a flexible smooth upgrade mechanism. The Operator controls the Sidecar to start the second MOSN, migrates the connection, and exits the Sidecar. Small-scale tests show that the whole process can be done without interruption, almost without feeling. At present, smooth upgrade also involves a lot of operation of POD Spec. Considering the stability before large upgrade, this method is not used on a large scale at present.

The problem of scale

The number of containers connected to the Service Mesh explodes as it gradually reaches the accelerated state. The number of containers rapidly expanded from 1000 level to 10W +, and eventually reached the scale of hundreds of thousands of containers in the whole station. After the expansion, it also experienced several version changes.

While running fast, the lack of platform capability also poses great challenges to large-scale Sidecar operations:

  • Version management mess:
    • The mapping between Sidecar versions and applications/zones is maintained in the configuration of the internal metadata platform. After a large number of applications are connected, the global version, experimental version, and special Bugfix version are mixed in multiple configuration items, and the unified baseline is broken and difficult to maintain.
  • Inconsistent metadata:
    • The metadata platform maintains the Sidecar version information of POD granularity. However, because Operator is final-state oriented, the metadata may be inconsistent with the underlying reality. Therefore, it still relies on inspection for discovery.
  • Lack of a comprehensive Sidecar OPS support platform:
    • Lack of multi-dimensional global view;
    • Lack of solidified grayscale release process;
    • Rely on manual experience to configure and manage change;
  • Monitoring noise is huge;

Of course, both the Service Mesh and PaaS development teams are already building capabilities, and these issues are being mitigated.

Technical risk construction

Ability to monitor

Our monitoring platform provides basic monitoring capabilities for Service Mesh and platoons, as well as application dimension Sidecar monitoring. Include:

  • System monitoring:
    • The CPU;
    • MEM.
    • The LOAD;
  • Service monitoring:
    • RT.
    • RPC flow;
    • MSG flow;
    • Error log monitoring;

The Service Mesh process also provides the Metrics interface for data collection and calculation of Service granularity.

inspection

After the Service Mesh went live, inspection was added successively:

  • Log Volume check;
  • Version consistency;
  • Time-sharing scheduling state is consistent;

Plan and Emergency

Service Mesh itself has the ability to disable some functions on demand, which is currently implemented through the configuration center:

  • The log level is degraded.
  • The Tracelog log classification is degraded.
  • Pilot dependency degradation;
  • Soft load balancing long polling degradation;

For services dependent on Service Mesh, corresponding plans are added to prevent potential jitter risks:

  • The soft load balancing list stops changing.
  • Service registry off push during peak hours;

Service Mesh is a very basic component, and the current emergency measures are mainly reboot:

  • Sidecar restarts independently.
  • POD restart;

Change risk prevention and control

In addition to the traditional change axe, we also introduced unattended changes, automatic detection, automatic analysis and change fuse for Service Mesh changes.

Unattended change prevention and control mainly focuses on the impact of changes on the system and business, and involves multiple levels of detection in series, mainly including:

  • System specifications: including machine memory, disk, and CPU;
  • Service indicators: RT and QPS of Service and Service Mesh;
  • Service link: upstream and downstream service exceptions.
  • Global business metrics;

Through this series of prevention and control facilities, site-wide Service Mesh change risks can be discovered and blocked within a single batch of changes, avoiding risk amplification.

In the future

During the rapid implementation of Service Mesh, a series of problems have been encountered and solved, but there are still more problems to be solved. As one of the core components of the next generation of cloud biogenic middleware, the technical risk capability of Service Mesh needs to be continuously advised and refined. The future requires continuous building in the following areas:

  • Large-scale and efficient access and rollback capability support;
  • More flexible ability to change, including smooth/non-smooth ability of business insensitive change;
  • More precise change prevention and control capabilities;
  • More efficient, low-noise monitoring;
  • More complete control surface support;
  • Parameter customization ability of application dimensions;

Welcome students interested in middleware Service Mesh and cloud native stability to join us and build the future of Service Mesh together.

Read the previous series

  • Ant Financial Service Mesh Large-scale landing series – Message
  • Ant Financial Service Mesh Large-scale landing series-core part
  • Person in charge of Service Mesh implementation: Ant Financial Double eleven four questions

Financial Class Distributed Architecture (Antfin_SOFA)