The practice of New Oriental stateful service In K8s

Author | Zhou Pei, Container Group expert, New Oriental Architecture Department

Stateful service construction has always been a very challenging task in K8s. In the process of stateful service cloud development, New Oriental adopts the mode of combining customized Operator with self-developed local storage service, which enhances the capability of K8s native local storage solution and steadily pushes forward the container construction of the enterprise.

Status quo of New Oriental stateful service In K8s

As shown in the figure above, the upper Pod is hosted by a custom Operator and StatefulSet controller, Pod is associated with PVC, PVC is bound to PV, and the lowest level is storage service.

The lowest level storage service includes local storage and remote storage. For common storage requirements, the remote storage service is preferred. For high-performance IO storage requirements, choose local storage services. At present, local storage services include K8s native local storage service and self-developed XLSS storage service.

Native K8s supports stateful service capabilities

The ability of native K8s to support stateful service is the foundation of stateful service construction, and its management mode is: StatefulSet controller + storage service.

1. StatefulSet controller

StatefulSet controller:

A controller used to manage the workload API objects of stateful applications. Manages the deployment and scaling of a collection of pods, and provides persistent storage and persistent identifiers for these pods.

StatefulSet Resources features:

A stable, unique network identity
Stable, persistent storage
Orderly, elegant deployment and scaling
Orderly, automatic rolling updates

Limitations of StatefulSet resources:

For storage, the StatefulSet controller does not provide storage provisioning.
When deleting or scaling down, the StatefulSet controller is only responsible for pods.
Humans build a headless service that provides a unique name for each Pod to create.
Gracefully delete StatefulSet. It is recommended to scale to 0 before deleting.
Orderliness also leads to dependencies. For example, a pod with a large number depends on the performance of the previous pod. If the previous pod fails to start, the later pod will not start.

These five limitations can be further summarized as follows: The StatefulSet controller manages the Pod and some storage services (such as the creation of PVC during capacity expansion), but not the rest. The dependence caused by order can also bring negative effects, which need to be cured by manual intervention.

2. Storage

Cloud Native Storage

This is a screenshot of CNCF official website about cloud native storage. There are more than 50 storage products, nearly half of which are commercial products. Most of the open source products are remote storage types, including file system support, object storage support, and block storage support.

K8s PV type

Data is obtained from the official website. The native K8s supports PV types such as RBD, HostPath, and Local.

How to choose?

Controller Only StatefulSet controller can be used. There are many storage products and PV types. How to choose?

What are the factors to consider when choosing a storage product? New Oriental considers the following factors when choosing storage products:

Open Source VS Business
Local vs. remote
Dynamic supply VS static supply
Data high availability solution

Making choices is a headache. For example, open source is free, but stability is hard to guarantee, and even provides limited capabilities. Commercial product capability and stability are guaranteed, but at a cost. Here is no conclusion, and ultimately depends on demand.

Self-developed storage product XLSS

1. Key requirements

Key requirements of New Oriental stateful service construction: good performance, support IO intensive applications; Data availability, with a certain disaster recovery capability; Dynamic supply, to achieve full automatic management of stateful services.

2. XLSS is introduced

XDF Local Storage Service (XLSS) is a high-performance and high-availability Storage solution based on Local Storage. It can solve the deficiency of local storage scheme in K8s: localPV can only be supplied statically; When localPV is used, the affinity binding between POD and Node reduces availability. Local storage may cause data loss.

Application scenarios

High performance applications, IO intensive applications such as Kafka
Dynamic management of local storage
Data security, regular application data backup, backup data encryption protection
Storage resource monitoring alarms, for example, K8s Pv resource usage monitoring alarms

3. XLSS In K8s

As shown in the figure above, the operating state of XLSS in K8s is that the three components of XLSS run as containers in a K8s cluster, use local storage to provide storage services for stateful services, and perform periodic backup jobs of data. XLSS provides metrics data about the storage and related jobs.

4. XLSS core components

Xlss main components include:

xlss-scheduler
- Custom scheduler based on Kube-Scheduler
- For the scheduling of pod with stateful service, it can automatically identify the user identity of XLSS LOCALPV, intelligently intervene pod scheduling, and eliminate the reduced availability caused by the affinity binding between POD and Node
xlss-rescuer
- Run in K8S cluster as DaemonSet resource type
- Perform data backup jobs based on data backup policies
- Monitor data recovery requests and perform data recovery jobs
- Provide metrics data
xlss-localpv-provisioner
- Dynamically feed local storage

5. Implementation ideas of xLSS-Scheduler key logic

As shown above, this is the scheduling framework model of the K8s scheduler, which contains many extension points in the scheduling process. Xlss-scheduler is based on this scheduling framework model and implemented by writing custom plug-ins, which are mainly enhanced in three extension points:

Prefilter: Analyzes the health status of the affinity node based on the affinity of the Pod. If the node is abnormal, set a special flag for the Pod.
Filter: Disables affinity for a Pod with a special flag.
Prebind: For pods with special tags, delete the special tags and send data recovery requests according to the scheduling results.

6. Logic of xLSS-Rescuer data backup jobs

There are three parts in the figure, one on the left and one on the right, and the communication is realized through a cache queue in the middle. The loop on the left collects the backup job policy and updates it to the cache queue. Main 3 steps:

Watch pod events
Get the backup policy from pod annotations, through which configuration information for the backup job is implemented
Synchronize backup policy to cache queue

The loop on the right performs a backup job. Also 3 steps:

Sort cache queue elements in ascending order by the point in time when the next backup job will be executed
Spin-down wait: If the current time has not reached the execution time of the latest backup job, the spin-down wait is performed
Performing a Backup Job

7. Logic of xLSS-Rescuer data recovery

The data recovery process and data backup process have similar implementation ideas, but different implementation logic.

The loop on the left does what it does: it monitors the recovery job request and updates it to the cache queue. Main 3 steps:

Watch CRD, monitors data recovery requests and receives data recovery requests sent by XLSS-Scheduler (data recovery requests are implemented in CRD mode)
Analyze CRD status to avoid repeated processing
Synchronize recovery requests to the cache queue

The loop on the right performs a recovery job. Here are 4 steps:

Update CRD instance status
Restores snapshot data to a specified directory
Update PV and PVC
Example Delete a CRD instance

8. XLSS – LOCALPV -provisioner

The XLSS-LOCALPV-provisioner component, which is specialized, provides dynamic creation of local storage. When the provisioner Pod receives a request to create storage, it first creates a temporary helper pod, which is scheduled to the specified node to create a file directory for local storage. This completes the creation of the actual PV back-end store, and when the storage is created, the Provisioner Pod removes the Helper pod. At this point, a dynamic creation of local storage is complete.

9. XLSS automated disaster recovery workflow

The complete automated disaster recovery workflow goes through six phases:

Data backup: Backs up PV data in pod granularity.
Node exception: at this time, the cluster is abnormal, a node is abnormal, such as server damage, resulting in the above POD working exception, finally, the stateful service pod will always be in the Terminating state.
Exception POD handling: When the pod of stateful service is in the Terminating state, to clean up these PODS, you can manually delete them, or you can use a tool to give these stateful pods a chance to create again.
Intelligent scheduling: Disables affinity and dispatches new pods to healthy nodes.
Data recovery: Pulls the latest snapshot data corresponding to the POD to restore data.
Service recovery: Starts applications and provides external services.

At this point, a complete automated disaster recovery workflow is over and finally back to square one.

Massive storage middleware services

Storage problem basically solved, then how to land? The answer is to build storage-oriented middleware services.

1. Kafka Cluster In K8s

Take the Kafka cluster as an example. The Customized Kafka operator is used to deploy the Kafka cluster and specify the storage service to use XLSS storage. The customized Operator + XLSS mode is adopted to construct the storage middleware service.

2. Stateful middleware service In K8s

The running status of stateful middleware services in K8s is shown in the figure above. These storage middleware service clusters are hosted by corresponding operators, and the underlying storage ADAPTS to various types of storage according to business needs. With the increasing scale of middleware service cluster, we built a PaaS control surface, through which users can manage various middleware service clusters running in K8s. The control plane interacts with the Apiserver directly. Users can add, delete, or modify CRD resources on the control plane, and the Operator can adjust the status of the middleware service cluster according to the latest status of CRD resources.

3. Example of a user applying for middleware services

This is an example of a user applying for a middleware service: the user applies for a service through the management console, fills in the relevant configuration information, and after the application is approved, the corresponding service can be created in the K8s cluster.

Deploy XLSS based on KubeSphere

If you want to use XLSS storage, how do you deploy it?

If the XLSS is deployed for the first time, plan local disks and create storage space for XLSS. Then, it is time to run the various components of XLSS into a K8s cluster. To deploy XLSS components into a K8s cluster, we relied on KubeSphere’s CI/CD pipeline. The custom pipeline consists of five steps to transform XLSS components from static code to containers running in K8s, with a high degree of automatic maintenance.

CI/CD assembly line is shown in the figure below:

Road Map

At present, new Oriental’s stateful service container construction can be roughly divided into four stages.

Stage 1: “Pre-cloud Era”

The starting point for containerization of stateful services defines the goal of containerization. At this stage, stateful services are mainly characterized by VM+PaaS combination mode management stateful services. Main functions: resource management, white screen operation and maintenance, simple scheduling policy, runtime management.

Stage 2: “Going to the Cloud”

Starting at this stage, try to move stateful services out of the VM and onto the K8s platform. In this stage, stateful service is mainly characterized by mode management stateful service of K8s+Operator combination.

At this point, the runtime is hosted to K8s and the stateful service is taken over by Opeartor, with significant automation. At this time, some shortcomings are exposed: for example, the performance of remote storage is not good enough, and the availability of local storage is not guaranteed.

Stage 3: “The Road to Self-research”

It is mainly the practice stage of XLSS developed by New Oriental, which has been involved in the previous chapter. The typical feature of stateful service construction at this stage is the Scheduler + Logical Backup combination pattern. This is basically what we want: local storage + dynamic provisioning + data availability assurance. But things are never going to be perfect, so what about the flaws?

The data recovery time depends on the amount of data. If the amount of data is large, the recovery time will also increase. In the case of node exceptions, this will increase the unavailability time of stateful services.
Currently, PV data has not been stored in isolation, which can not restrict the application’s use of storage, and there is a certain risk.

Despite its flaws, XLSS can still be used in small-scale storage scenarios such as Redis and Kafka, but for services with large data volumes, XLSS is still weak.

Stage 4: “Striving for Excellence”

At this stage, stateful service construction is typically characterized by the Isolation + Physical Backup composite pattern. The focus will address the flaws found in phase 3. The general solution is: using LVM technology to achieve storage isolation; The DRBD technology is used to increase the DRBD synchronous physical backup capability to implement synchronous real-time backup of application data, and solve the problem that the recovery time increases due to a large amount of data.

There is a tradeoff when using DRBD technology, and that is the setting of the number of copies. If the number of copies is large, the storage resource usage increases. If the replica number is set less, the number of nodes available for stateful Service Pod drift will be reduced in the case of K8s cluster node exceptions. Ultimately, you need to make a reasonable choice based on the business scenario.

This article is published by OpenWrite!