This paper, compiled from a lecture of the same name in iQiyi Technology Salon, mainly introduces the business scenarios of ByteDance’s stateful application, challenges and solutions in state management, basic capability enhancement and automated operation and maintenance in the process of stateful application of cloud biotechnology, and the benefits of stateful application of cloud biotechnology.

Author | Peng Zhao is a senior R&D engineer in Bytedance infrastructure Team

background

When it comes to stateful applications, we start with stateless services. Stateless means that application instances can migrate smoothly and scale horizontally without significant differences between them. Such services work well with objects such as K8s (including Deployment) in the cloud protogenicity process and are therefore among the first cloud native beneficiaries. Stateful application refers to an application that holds specific data and relies on it to provide services. Large-scale applications usually have features such as Sharding, Replica and data persistence. Stateful applications are divided into data stateful and network stateful.

Data stateful applications have the following characteristics:
- Data dependency: Depends on local data during operation.
- Data persistence: Data cannot be lost before and after the upgrade.
- Dependency: Service instances have primary/secondary and primary/secondary dependencies. Therefore, each instance has a unique ID.
Network stateful application: Service services in a container must maintain a long network session.

Stateful network is a form other than stateful data. The content shared in this paper mainly focuses on the landing of stateful data applications in bytes.

Stateful application scenarios

There are a lot of stateful applications inside the byte. Some common scenarios are:

Search recall: Instances need to load large models and take a long time. If data needs to be reloaded for each upgrade, the network and storage resources will be wasted, and the iterative effect of services will be greatly affected. Therefore, services rely on local storage.
Push: Some service instances have strong dependencies or require unique ids for instances. For example, in the push service, each instance is responsible for the push of a fragment user and has a unique ID requirement for the instance.
Storage service: including self-developed KV (Redis-like storage service), Druid, ES, taking into account the above two stateful characteristics, not only depends on local storage, but also has instance dependence between services, which is the unique ID requirement.

Before cloud Biochem, services were mostly deployed on physical machines. In the era of physical machines, problems such as complex architecture, inflexible operation and maintenance, inconsistent physical machine environment, and resource fragmentation have not been well solved. This is the pain point of cloud protobiotics, byte’s understanding of cloud protobiotics in terms of both efficiency and cost.

The efficiency of

Standardization of infrastructure: The cloud can mask the complexity of the underlying systems (computing, storage, networking) and abstract a unified API to allow users to express their needs for the underlying infrastructure.
Business framework abstraction: The choreography of the business can be managed uniformly.
Standardize process automation: Make updating and maintenance of applications easier.
Consistency of delivery patterns: A unified state of business runtime based on mirroring or container technology.

The cost of

Cost of application iteration and release: Focus on pulling up containers by the second, giving the business more room to iterate and develop.
Resource cost optimization: Allocate the resources required by the business on demand.

Of course, the road of Cloud biogenesis is not smooth, there are some challenges in the state management of stateful applications, basic capability enhancement and automatic operation and maintenance, and we have solved many related technical problems in the process. In general, on the internal K8s base, we have accepted thousands of internal stateful services covering 2W + nodes through orchestration optimization (including CRD, Controller, Webhook and other capabilities) and enhancement of basic capabilities (including performance optimization and enhancement of storage capabilities). 100W + CPU Core, 5W + Pod.

State management for stateful applications

State management for stateful applications can be broken down into three problems:

Version management: administrative capabilities similar to K8s Deployment or Statefulset, how to roll back version upgrades, etc.
Data management: Dependent external data needs to be updated while service replicas remain unchanged.
Service discovery and routing: How requests are routed to corresponding instances.

Let me give you an example. Suppose we have a self-developed massive KV service. Due to the large amount of data, a single instance cannot bear such a large amount of data. First of all, we need to divide the data into multiple shards. Each Shard is modelled according to the hash value of Key, and the corresponding Pod inside a Shard is responsible for part of the data to provide services externally. At the same time, to ensure high availability, there are multiple Pod copies in a Shard, which may have a master/slave relationship. So, for this stateful application, you can expand all instances to form a matrix, each column of which is responsible for providing multiple Pod copies of the same Shard service externally. In addition, stateful applications are sensitive to external data, and data may be updated even if the instance copy remains unchanged. For example, the KV service needs to load the latest data version every hour and provide the data serving of this version.

Corresponding to the example above, the matrix formed by all the above instances can be mapped to the abstract SolarService of a stateful service. One column of the matrix, a Pod for a Shard, corresponds to the Statefulset Extention shown above. We use CRD to enhance in-place upgrade (mirroring version, environment variable update), customization of upgrade order, and low-traffic/full-traffic features on top of Statefulset. In addition, the data needs to be updated in rotation while the service copy remains unchanged. Data management is done by another CRD (Budset). Budset and Statefulset are one-to-one correspondences. The Budset Operator generates several CRD Buds (Bud and Pod are one-to-one correspondences) according to the Budset definition, indicating the expected data state of the Pod. In addition, we inject a DataSync Sidecar container into each Pod, listen to the Bud corresponding to our Pod, complete data downloading, and update the Bud status. SolarService is a combination of StatefulsetExtension and Budset. Here are two examples of how SolarService Controller works.

Rolling upgrade

First, shards are horizontally shelled. Multiple shards are internally upgraded concurrently. The rolling granularity of shards can be configured. In a Shard we concurrently upgrade multiple copies within a Shard based on MaxUnavailable configuration of the Statefalset Extention.

capacity

Capacity expansion can be divided into two situations:

Increase the number of Replicas for a Shard: This is a simple case, similar to a normal StatefulSet expansion. Adjust the Replicas field.
Expand Data Shard: The number of Data fragments needs to be expanded. This situation currently only supports multiple scaling, broken down into several steps, as shown below:

So let’s say you start with just the whole data, all in Budset 1. Pods in Statefulset Extention 1 all load Budset 1 data. When scaling up, the first step is to expand Statefulset. In this case, the data in Statefulset Extention 1 and 2 are full data. Then update Budset, the original full amount of data will be cut into two shards, after these processes are completed, update service discovery, at this time the Pod of Statefuleset Extention 2 will officially accept traffic. There is actually a problem: when Budset changes, the data in the Pod of the two Statefulset Extention is still full. At this time, we have some cooperation with the business framework, some businesses may define their own data exit TTL logic, then just wait for the data to cool down. In addition, some businesses customize operations that trigger data Compaction to expel unwanted data.

Service discovery and routing

Service discovery and routing consists of two main points. As mentioned in the previous example, each Pod column in the matrix of stateful service instances provides Data services for different Data shards. So when a request comes in, you need to know which Shard instance it is routed to.

The figure shows a Proxy business layer component that uniformly dispatches requests to pods corresponding to the Statefulset Extension. At the same time, there are multiple Pod copies in the same Shard, and their error rates are not exactly equal due to slight differences in the host’s performance or other reasons. This is where you can do layer 2 routing logic to further enhance the service routing/circuit breaker logic based on the Pod error rate within a Statefulset Extension. Based on this complex custom logic, we do not rely on K8s Service to route requests. Instead, we register host IP and port through our own Service Discovery infrastructure and write to the Service Discovery component via KV. For stateful Service, ShardID, ReplicaID, total number of Shard and other information are injected into the Service Discovery component to facilitate the upper-layer framework to read from KV and formulate its own fusing and routing strategy. The Proxy component shown in the figure above is a common service form: it encapsulates stateful services to complete routing and forwarding. In addition, request forwarding can be further combined with service mesh. By means of fat client, the upstream service routes each request to the corresponding Pod to reduce the overhead of Proxy layer.

Enhanced basic capabilities

Our basic capability enhancements include scheduling and storage.

scheduling

In terms of scheduling capability, in order to pursue the ultimate performance optimization, we have made some enhancements to Scheduler and Kubelet of K8s based on NUMA architecture of modern servers. NUMA refers to the non-uniform memory access architecture. In the standard architecture of a multi-core processor, the latency of CPU access to different memory is different. In addition, not only memory has such a feature, but also GPU devices or network cards have such microtopology affinity. By binding service pods to NUMA nodes, which are memory adjacent to the CPU, server performance can be optimized to the extreme from the system level. Specific practices are as follows:

Kubelet reports the resource quantity and total amount of the available microtopology of this node through a CRD.
When the Pod enters the scheduling process, the scheduler selects the nodes that match the microtopology during the pre-selection phase through self-predicate.
When the scheduler reaches the Priority phase, it stacks NUMA resource allocations as much as possible by developing priority to reduce fragmentation.
In Kubelet, that is, at the stand-alone level, we will calculate the CPU Set memory and NUMA node corresponding to Pod in the Pod admit stage through our self-developed CPU Manager Policy. Set the CPU core that the Pod can use and the corresponding NUMA node when using kubelet Sync Pod through the native CRI interface.

storage

In terms of storage capacity, we support multiple storage media through Dynamic Provison, as well as remote block storage and online expansion of local storage. Remote Block storage The remote block storage solution is a standard CSI interface based on the NBD completion (the attach and dettach processes are removed in the internal implementation). This NBD-based network device currently supports two modes: single-write single-read and multi-read (shared read-only). The two components in the figure, External Provisioner and another CSI Plugin at the standalone level, are homegrown; the others are native.

When the Statefulset Extention creates the corresponding PVC and Pod, the External Provisoner listens to the PVC create event and provides a PV to bind to the PVC. After Pod to the standalone level, CSI driver will execute the corresponding CSI standard protocol inside the Nodeserver function, including Node stage/publish Volume, etc..

Local storage

First, a little background on Volume Scheduling in the community. Volume Scheduling refers to the unified management and allocation of Pod storage resources and computing resources (CPU, Memory, etc.) when the scheduler selects storage volumes. Volume Scheduling includes three stages:

Predicate: Similar to Pod resources, affinity depends on which node PV node affinity can be scheduled to.
Assume Volume: The scheduler will prejudge that a PVC should be bound to a node, and the native scheduler will annotated the PVC to indicate that it is expected that the PVC should be bound to the node.
Bind: The scheduler checks whether the PVC and PV are successfully bound, and if so continues to Bind the Pod to the node.

This is also the key to our local volume storage for Dynamic Provision. In fact, the community LPV solution is in the form of Static Provisioning, while the internal LPV Dynamic Provision is dynamically created by listening to Assume Volume, a scheduler. To be specific:

Unlike External Provisioner, LPV Provisioner and Driver are packaged ina Binary Provisioner and deployed as a Daemon, and each Pod reports the current node storage resources through the CRD. After the Pod enters the scheduling process, the self-developed Predicate filters the remaining available storage resources for each node to select viable nodes.
LPV Provisioner listens to the PVC that the scheduler pre-assigns to the current node, and if the scheduler assumes Volume (updates the PVC Annotattion), tries to create a PV and PVC for binding. If the PV fails to be created, the ANNOTATION made by the PVC scheduler will be cleaned up, which will trigger the scheduler to re-schedule.

Internal local storage supports several storage media:

Memory-based TMPFS
Logical Volume based on LVM
I/ OS of different services are isolated by allocating the entire disk
AEP

AEP is a new non-volatile storage device launched by Intel, and its performance is far better than SSD. AEP can be used in two ways: as memory or as disk. In our scenario, we could create a filesystem on the AEP as a disk, mount it fsDAX (since the LATENCY of the AEP device itself is already comparable to memory, there is no need for additional overhead through the OPERATING system page cache layer), and write directly to the filesystem. An AEP device can be divided into several namespaces, which can be regarded as several disks. From the volume perspective, AEP as a disk allocation logic is similar to LPV allocation of local disks. However, AEP devices also have the requirement of NUMA affinity allocation. That is, unified device management must be taken into account when allocating CPU memory. K8s V1.16 introduced the Topology Manager feature, which takes into account the proximity of the device and the CPU. We extended the Topology Manager Policy to allocate cpus, memory, and devices to maximize device performance.

Monitoring and automatic operation and maintenance

Monitoring system

We developed a container level system monitoring component SysProbe based on eBPF, which can collect 100+ Metrics of host machine including container. In addition, the self-developed highly available Metrics Aggregation Server (MAS) continuously obtains SysProbe Metrics and connects to multiple downstream sinks, such as MQ and TSDB, to provide users with a rich monitoring panel.

Automated operation and maintenance

In terms of automated operations, let’s go back to what we did with the PDB. Compared with stateless applications, stateful applications put forward higher requirements for automatic operation and maintenance:

The cost of stateful Pod state recovery is high.
K8s does not know the priority of migration and lacks meta information for automated operation and maintenance.

To address these issues, we’re doing Pod Eviction for mainframe engineering. Before the host goes offline, expel the Pod from the host through the K8s API. The main reason the DELETE Pod interface wasn’t used is because Eviction checks THE K8s PDB resources, and we were able to customize the expulsion policy by extending the PDB (which blocks Evictions requests via Webhook).

The figure above shows an example of multiple machine room expulsion. Multiple AZs in a Region may have their own K8s clusters, in which equivalent Solarservice is deployed and belongs to the same service. During expulsion, the instance ratio between the two AZs in the figure should be considered at the same time. In this way, it will not lead to the situation that all pods in an AZ are expelled, and the error rate in this AZ is soaring, but the total number meets the requirements. We did this by saving our custom policy PDB Extension as a CRD in Meta K8s across azs to check whether the expulsion was legal.

CSI Race Condition

In addition, the cloud native practice also encountered many CSI problems. When removing a Pod, there are two corresponding functions in the native CSI interface:

NodeUnpublishVolume: Calls the driver corresponding to CSI to clear the mount point corresponding to Pod.
NodeUnstageVolume: Unmounts the volume from the node.

When Kubelet deletes a Pod, however, it only determines if the first thing is done. Therefore, in a short time window, if operation and maintenance or other situations occur, race condition may occur.

The Global Mount Mount point remains

In this case, the mount point is cleared after executing the first function, but the volume remains on the host. If Kubelet is restarted and finds that the Pod has been deleted, it will only look at the volume used by the Pod that is still alive on the current node. Volumes that have not completed unstage will not be deleted. Our solution is to add residual mount point scanning operation in Kubelet Volume Manager for FS type volumes to clean residual mount points.

Open the volume that is being unmounted again

This also happens after Kubelet removes Pod and before NodeUnstageVolume. If a Pod is deleted without unstageVolume, a new Pod has been created and is scheduled on other nodes, and the new Pod needs to mount the same volume, then Kubelet is trying to mount it again from the storage side. For example, in the nbD-based block device mentioned earlier, in a single-read single-write mode, if a new node attempts to establish an NBD connection while the old volume connection remains, the storage server will detect an exception and raise an alarm.

**Case Study **

Finally, several problems encountered in the docking process are introduced. Once the host fails due to NBD instability, all the PODS on the whole host will be affected. Therefore, we are also actively trying to implement lightweight virtualization solutions based on Kata to reduce the explosion radius and the failure range from host granularity to Pod granularity.

conclusion

In the process of Bytedance cloud protogenesis, the stateless application gradually enters the cloud native docking of stateful application. Stateful application generally has the following characteristics:

Data has local dependencies;
Data persistence. Data cannot be lost before and after the upgrade.
There are relationships between service copies that require unique ids to distinguish them.

In the process of cloud protobiochemistry, it brings efficiency and cost benefits to stateful applications and solves some pain points of operation and maintenance of stateful applications in the physical machine era:

State management: enhanced orchestration, linkage service discovery;
Extreme performance: The pursuit of extreme server performance through scheduling enhancement and aware microtopologies;
Storage capability: Improves storage capability through various storage Settings.
Operation and maintenance management: improve the degree of automation, complete the customized strategy based on PDB Extention, and operate and maintenance services by way of expulsion.

Bytedance stateful application cloud native practice