The author | si-yu wang (the wish) source | alibaba cloud native public number
Thanks to Kubernetes’ end-state orientation, cloud native architectures are naturally highly automated. However, end-state oriented automation is a double-edged sword, bringing declarative deployment capabilities to applications while potentially amplifying misoperations by end-state. Therefore, it is important to fully understand the potential problems affecting application security in the cloud native environment, and to master the multi-directional security protection, interception, current limiting, and fuse breaker in advance to ensure the stability of the cloud native application during operation.
This article is compiled from the author ali Cloud container service technical expert, OpenKruise author & one of the start-up staff, Kubernetes, OAM community contributor Wang Siyu (wine Zhu) on January 19 in Ali Cloud developer community “Open Source Tuesday Day” live share. This paper introduces the “crisis everywhere” of application security and availability in the cloud native environment, shares Alibaba’s experience in ensuring the stability of cloud native application at runtime, and explains in detail how these capabilities will be enabled to open source through OpenKruise in the future.
Click back to see the full video: developer.aliyun.com/live/246065
Cloud Native Environment Application Security “Crisis”
1. Alibaba Cloud native application deployment structure
The cloud native application deployment structure here is the most simplified abstract diagram of alibaba’s native environment, as shown below.
Let’s start with a couple of CRDS. CloneSet CRD can be understood as a workload of Deployment, which is a template for deploying a Pod to an application. With CloneSet CRD, different businesses and different applications will establish corresponding CloneSet, establish corresponding Pod under CloneSet, and manage Pod deployment and release.
In addition to CloneSet, a SidecarSetCRD is provided. What this CRD does is inject the Sidecar container defined in The SidecarSetCRD during the business Pod creation phase. That is, in CloneSet the business only needs to define the APP container in the Pod, which is the business container. During Pod creation, a SidecarSet is used to define which Sidecar containers to inject into the business.
2. OpenKruise: Alibaba application deployment base
Open source OpenKruise is the base for Alibaba’s application deployment. OpenKruise provides a variety of workloads. These include: CloneSet, Advanced StatefulSet, SidecarSet, and Advanced DaemonSet.
-
CloneSet: It is a tool for the deployment of stateless applications, and it is also the most widely used part in Alibaba. Most of the pan-e-commerce businesses are deployed and released through CloneSet, including UCWHAT, Ele. me, e-commerce business, etc.
-
Advanced StatefulSet: An enhanced version compatible with native StatefulSet, Advanced StatefulSet is a tool for stateful application deployment, currently primarily used for middleware deployment in cloud native environments.
-
SidecarSet: is a tool for sidecar lifecycle management in the Alibaba environment. Alibaba’s operation and maintenance containers and ali’s internal Mesh containers are defined, deployed and injected into business Pod through SidecarSet.
-
Advanced DaemonSet: A compatible enhanced version for native DaemonSet. Host level daemons are deployed on all nodes, including basic components used to configure networking and storage for business containers.
After introducing the base environment, we have a basic understanding of the cloud native deployment structure. Let’s take a look at what cloud-native application security risks exist under the cloud native deployment structure.
3. Cloud native application security crisis
1) Workload cascading deletion
Workload cascade deletion is a problem not only for Kruise’s CloneSet, but also for Deployment and native StatefulSet. This means that after deleting a Workload, assuming that default deletion is used and orphan deletion is not used, all the underlying pods will be deleted. There is a risk of error deletion. In other words, if a Deployment is deleted by mistake, all the pods underneath it will be cascled down, making the entire application unavailable. If more than one Workload is deleted, many businesses may become unavailable, which is a usability risk. As shown below:
2) Delete the namespace in cascading mode
If the Namespace is deleted, all the resources under the Namespace, including Deployment, CloneSet Workload, Pod, Service and so on, will be deleted. This is a high risk of error deletion.
3) CRD cascading deletion
If you are using Helm to deploy, you may have encountered a similar situation. If your Helm contains some CRDS defined in the template, then when Helm uninstall is installed, Basically, these CRDS will be deleted by the Helm package cascade. If someone manually deletes a CRD by mistake, the corresponding CRDS under the CRD will be deleted. This is a high risk.
If a CRD is a CloneSet Workload level CRD, deleting this CRD will cause all CloneSet CRS under the CRD to be deleted, resulting in all business Pods being deleted. In other words, if a Workload is deleted, the Pod under the Workload is deleted. Deleting a Namespace may only result in the Pod under the Namespace being deleted. However, in a scenario like Alibaba, if someone deletes CloneSet or some key CRDS, it is very likely that all the PODS under the NameSpace in the whole cluster environment will be cascaded or deleted, or all the applications will be unavailable. Risk to application availability in cloud native environments. As shown below:
As we can see from the above, the benefit of cloud native conceptual architecture is to be end-state oriented, that is, we define the end-state so that the entire Kubernetes cluster moves closer to the end-state. However, if some misoperation leads to the definition of a wrong end state, then Kubernetes will also move towards the wrong end state, resulting in the wrong result, which affects the usability of the entire application. Thus we say that end-state orientation is a double-edged sword.
4) Concurrent Pod updates/evicts/deletes
In addition to a few cases of mistaken deletions, there are more risks to usability. As shown in the figure below, assume that CloneSetA on the left deploys two pods, which in turn are injected into the corresponding Sidecar container by SidecarSet. In this case, if we do application distribution through CloneSet, let’s say we set Max Available to 50%, that is, the two pods are upgraded one by one, the first one is completed, and the second one can be upgraded. By default, this distribution strategy is fine.
However, if the Pod has multiple owners, for example, CloneSet is one of the owners, CloneSet starts to upgrade the Pod in place, and SidecarSet does the sidecar upgrade for the second Pod. It is possible that both pods of the application are being upgraded at the same time. Because In CloneSet Max Unavailable is defined as 50%, from its point of view, just pick one of the two pods and start doing the upgrade. CloneSet itself is unable to perceive other controllers or even other people’s artificial behaviors to operate other PODS, lacking a global perspective. Each controller thinks that the Pod they are upgrading conforms to the upgrade strategy and the maximum unavailability strategy. However, when multiple controllers start working at the same time, 100% of the application may become unavailable.
As shown on the right, CloneSetC has three pods under it. If it starts an upgrade with only one of them, it will delete the old version and build the new one. In this process, assuming that one of the other two pods may be expelled by the Node Lifecycle Controller in Kubelet or kube-Controller-Manager, The maximum unavailable publication policy defined in Workload has been exceeded. In the process, some pods may be manually deleted by other controllers. Various possibilities lead to the number of Pod unavailability under a Workload that is likely to exceed the unavailability publication policy defined in the Workload itself.
That is, if Max Unavailable is defined as 25% in Deployment, then Deployment guarantees that 25% of pods are being released from its perspective at the time of release. The other 75% of pods are not guaranteed to be fully available. The 75% of pods may be expelled by Kubelet, manually deleted, hot upgraded by external SidecarSet, etc., which can result in Deployment being unavailable for more than 50% or even higher. The whole application is affected.
Security protection practices for cloud native applications
In view of the above crises, what measures can we take to ensure the availability and security of application security in the native environment? Here are some practical experiences.
1. Protection against cascading deletion
Cascading deletion is detrimental to application availability, including deleting CRD nodes, Namespace nodes, and Workload nodes. Anti-cascading deletion defines labels for a variety of resources, including CRD, Namespace, Workload including native Deployment, etc.
The following are the names for preventing cascading deletion of important nodes:
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
labels:
policy.kruise.io/disable-cascading-deletion: true
---
apiVersion: v1
kind: Namespace
metadata:
labels:
policy.kruise.io/disable-cascading-deletion: true
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
policy.kruise.io/disable-cascading-deletion: true
---
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
labels:
policy.kruise.io/disable-cascading-deletion: true
Copy the code
Kruise ensures that resources are verified against cascading deletion after anti-cascading deletion labels are added to CRD, Namespace, and workload of the user. That is to say, when a user deletes a CRD, kruise will check whether there is still a stock of CR under the CRD if there is a stock of CR, kruise will prohibit the deletion of CRD.
Similarly, when deleting a Namespace, the system will check whether there are existing running PODS under the Namespace. If there are, the system will prohibit users from deleting the Namespace directly.
The workload logic is relatively simple. For Deployment, CloneSet, and SidecarSet, when workload is deleted, if the workload user has defined a label to prevent cascading deletion, Then Kruise will check whether the replica of the workload is 0. If the replica is larger than 0, Kruise prohibits users from directly deleting the workload bearing anti-cascading deletion mark. In other words, kruise prohibits users from directly deleting a storage Deployment if replicas is greater than 0 and there is an anti-cascading deletion identifier in the Deployment.
If you do need to remove Deployment, there are two ways to do so:
-
First, set the replica to “0”, then the lower Pod will be deleted, then it is no problem to delete the Deployment.
-
Second, you can remove the anti-cascading delete flag from the Deployment.
The above is about the introduction of anti-cascading deletion, you should understand anti-cascading deletion as a basic security strategy, because cascading deletion is a very dangerous end-state capability in Kubernetes.
2. Protection practice — Pod remove flow control & fuse
The policy of flow control and fusing for Pod deletion refers to that when the user calls or uses the controller to do Pod expulsion with K8s, once there is a misoperation or logical anomaly, it is likely to lead to the large-scale deletion of Pod in the whole K8s cluster. A Pod deletion policy, or CRD, is made for this situation. The CRD user can define the maximum number of pods allowed to be removed within a cluster in different time Windows.
apiVersion: policy.kruise.io/v1alpha1
kind: PodDeletionFlowControl
metadata:
# ...
spec:
limitRules:
- interval: 10m
limit: 100
- interval: 1h
limit: 500
- interval: 24h
limit: 5000
whiteListSelector:
matchExpressions:
- key: xxx
operator: In
value: foo
Copy the code
As in the example above, a maximum of 100 pods can be deleted within 10 minutes, a maximum of 500 pods can be deleted within 1 hour, and a maximum of 5,000 pods can be deleted within 24 hours. Of course, you can also define some whitelists. For example, if some test applications are inspected and tested frequently every day, frequent deletion will affect the flow control. You can provide a whitelist.
In addition to whitelisting, perhaps 90 percent of regular or core applications are protected by deletion flow control. Once there is a scale error deletion operation, it is protected by deletion flow control and circuit breaker mechanism. It is better to provide such an alarm mechanism and monitoring mechanism after protection or triggering threshold, so that the cluster manager can quickly sense the online circuit breaker event. It also includes helping managers determine whether a circuit breaker event is a normal event or an abnormal event.
If there are many deletion requests during this period, you can enlarge the corresponding policy value. If it is really some mistakenly deleted, after the interception, according to the source of the request to do tracing, timely in the search level to do a circuit breaker, reject these requests.
3. Protection practice – Application dimension unavailable quantity protection
Number protection is not available for application dimensions, for K8s native, native Kubernetes provides a PodDisruptionBudge policy, but THE PDB can only block Pod eviction operations, WHICH is designed to exclude Pod eviction operations.
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: xxx
spec:
minAvailable: 2
selector:
matchLabels:
app: xxx
Copy the code
In the example above, if there are five pods, minAvailable=2 guarantees that at least two pods are available. When Pod eviction is designed for 2 existing pods, the PDB will protect Pod availability and reject the eviction when 3 pods are unavailable and 2 pods are left. However, if you delete or upgrade the existing 2 pods, or do other things that make Pod unusable, PDB is unable to block, especially for Pod delete requests, more common than Pod expulsion, but PDB is unable to block delete requests.
For these problems, Alibaba has done a PodUnavailableBudget interception operation, which is PUB. Almost any operation that makes a PodUnavailable is covered by PodUnavailableBudget, including eviction requests, Pod delete requests, in-place application upgrade, Sidecar in-place upgrade, container restart, etc. All operations that cause the application to be unavailable are blocked by PUB.
Take the following example:
apiVersion: policy.kruise.io/v1alpha1
kind: PodUnavailableBudget
spec:
#selector:
# app: xxx
targetRef:
apiVersion: apps.kruise.io
kind: CloneSet
name: app-xxx
maxUnavailable: 25%
# minAvailable: 15
status:
deletedPods:
pod-uid-xxx: "116894821"
unavailablePods:
pod-name-xxx: "116893007"
unavailableAllowed: 2
currentAvailable: 17
desiredAvailable: 15
totalReplicas: 20
Copy the code
Define a PUB, which can write a selector range like the native PDB, or can be directly associated with a Workload by targetRef. The protection scope is all the pods under the Workload. It is also possible to define the maximum unavailable quantity and the minimum available quantity.
Assuming there are a total of 20 pods under CloneSet, it is important to ensure that at least 15 pods are available when maxUnavailable: 25% is defined. That is, PUB will guarantee that up to five of the 20 pods will be unavailable. To return to an example we discussed earlier in the “Crisis” section, if all 20 pods were released simultaneously by Cloneset, expelled by Kubelet, or manually removed, once the number of unusable pods exceeded 5, Whether Kubelet evicts the remaining 15 pods or manually deletes some of the remaining pods, these operations will be intercepted by PUB. This strategy can completely guarantee the availability of the application during deployment. PUB can protect a much larger range than PDB, including in the actual use of some unexpected deletion requests, upgrade requests, so as to ensure the stability and availability of the entire application at run time.
4. Protection practice – PUB/PDB automatically generated
For real Depoyment application developers and O&M personnel, you can only customize templates for their workload, while the service can offer only Depoyment templateK service version, environment variables, ports, and services. However, it is difficult to force every business party to write a SEPARATE PUB or PDB protection policy CR when defining the application. So how do we provide automatic protection for every application?
Within Alibaba, we provide the ability to automatically generate PUB/PDB for each Workload. For example, if the user creates a new Deployment at this point, a matching PUB is automatically generated for that Deployment through the controller. This automatically generated capability supports both Native Deployment/StatefulSet and Kruise’s CloneSet/Advanced StatefulSet/UnitedDeployment. Second, the default is based on the maxUnavailable policy in strategy. Third, allow annotations to define separate protection policies. The following statement looks like this:
apiVersion: apps
kind: Deployment
metadata:
name: deploy-foo
annotations:
policy.kruise.io/generate-pub: "true"
policy.kruise.io/generate-pub-maxUnavailable: "20%"
# policy.kruise.io/generate-pub-minAvailable: "80%"
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25%
maxSurge: 25%
# ...
---
# auto generate:
apiVersion: policy.kruise.io/v1alpha1
kind: PodUnavailableBudget
spec:
targetRef:
apiVersion: apps
kind: Deployment
name: deploy-foo
maxUnavailable: 20%
Copy the code
Automatically generated PUB/PDB internal filled maxUnavailable allows users to specify definitions in KRUise. IO /generate-pub:”true”, or kruise. IO /generate-pub-maxUnavailable:”20%”, which allows the user to specify how many applications are allowed to be unavailable. This is a user-specified policy.
If the user does not specify a policy, the PUB is generated based on the maxUnavailable that exists in the publish policy. This refers to the number of unusable numbers at the release stage, as the maximum number of applications that cannot be run. This allows policies to be defined separately.
New territory for OpenKruise
1. OpenKruise is introduced
Finally, I will introduce how to open up the aforementioned open ability in the new field of OpenKruise and how to expand the understanding of OpenKruise. OpenKruise is alibaba Cloud open source Kubernetes extended application load project, essentially is around Kubernetes cloud native applications to do a series of automation capabilities of the engine, but also alibaba economy cloud fully used deployment base.
OpenKruise is not positioned as a complete platform, but rather as an expanded product in Kubernetes. As an Addon component, this expanded product provides a series of automation capabilities for deploying applications in Kubernetes, as well as subsequent protection and protection applications available, and around cloud native applications. These expanded capabilities or enhanced capabilities are not available in native Kubernetes. However, it is also in urgent need of these capabilities, which are some general capabilities that Alibaba has accumulated in the process of gradual evolution of cloud primitive.
Currently, Kruise provides the following workload controllers:
-
CloneSet: Provides more efficient and controllable application management and deployment capabilities. It supports elegant in-place upgrade, specified deletion, configurable release order, parallel/grayscale release, and other rich strategies to meet more diverse application scenarios.
-
Advanced StatefulSet: An enhanced version built on top of native StatefulSet. The default behavior is exactly the same as native, with in-place upgrade, parallel publishing (Max unavailable), publishing pause, and more.
-
SidecarSet: manages sidecar containers in a unified manner and injects a specified Sidecar container into a Pod that meets the selector conditions.
-
UnitedDeployment: Deploys an application to multiple available zones using multiple Subset workloads.
-
BroadcastJob: sets a job to run a Pod job on all nodes that meet the requirements in the cluster.
-
Advanced DaemonSet: An enhanced version based on native DaemonSet. The default behavior is the same as native, and grayscale batching, Nodelabel selection, pause, hot upgrade, and other publishing strategies are provided.
-
AdvancedCronJob: an extended CronJob controller. Currently, the Template template supports configuration using Job or BroadcastJob.
2. Original workload capability defects
According to Deployment go CloneSet, AdcancedStatefulSet because there are many native workload capability defects. As you can see, basically since Kubernetes 1.10 version, in fact, other functions, including pod, its field is still continuously enriched, including more pod capability support, more policies, but for the workload level, At both the Deployment and StatefulSet levels, there is no tendency to make any changes. The community is thinking behind this because there are many requirements at the application deployment and distribution level in different companies and business scenarios.
The Deployment provided by Kubernetes native is oriented to some of the most general and basic environments, so it cannot be used to meet all business scenarios. However, in fact, the community encourages users with higher requirements, larger and more complex scene scale to expand and write by CRD. Utilize more powerful workload to meet the requirements of different business scenarios.
3. Comparison between OpenKruise and native capabilities
Orange: Open source/green: Open source
Therefore, for this scenario, Kruise has made a relatively complete deployment of stateless and stateful applications. The comparison of workload provided by Kruise with native Deployment, StatefulSet and DaemonSet can be seen from the table in the figure above.
4. OpenKruise 2021 plan
As shown in the figure above, OpenKruise is a cloud native application automation engine that currently provides workload capabilities in application deployment, but is not limited to the field of application deployment.
1) Risk prevention and control
In the planning of the first half of 2021, we will export the risk and prevention and control strategies of cloud native applications mentioned above to the community through OpenKruise. Including CRD deletion defense, cascading deletion defense, global Pod deletion flow control, Pod deletion/eviction/in-place upgrade defense, and automatic WORKLOAD generation PDB/PUB.
2) Kruise – daemo
In addition, OpenKruise was only deployed as a central controller before. In the next version, a Kruise-Daemon will be provided to be deployed on each node through daemon set, which can help users to do some strategies for image preheating, release acceleration, container restart and single-machine scheduling optimization.
3) ControllerMesh
ControllerMesh is a feature that OpenKruise provides to help users manage the running time of other controllers in a user cluster. It solves the problems caused by the traditional controller live mode through flow control.
Finally, in terms of community construction of OpenKruise project, on November 11, 2020, all members of CNCF Technical Supervision Committee voted unanimously to officially enter CNCF Sanbox, and received positive responses from CNCF in the whole process. It is indicated that The OpenKruise project fits well with the concept advocated by CNCF, and encourages more projects like OpenKruise that can do some generalization, face more complex scenes, and have larger scale of such independent Workload ability.
Many companies are already using OpenKruise’s capabilities, such as:
-
Based on the demand of in-place upgrade and grayscale publishing, ctrip used CloneSet and AdvancedStatefulSet to manage stateless and stateful application services respectively in the production environment, and the number of Kruise workload in a single cluster reached ten thousand levels.
-
OPPO not only uses OpenKruise on a large scale, but also further strengthens its in-place upgrade capability downstream with its customized Kubernetes, which is widely used in back-end operation services of multiple businesses, covering 87% of the upgrade deployment requirements through in-place updates.
-
In addition, domestic users include Suning, Douyu TV, Youzan, Bixin, Boss Zhipin, Shentong, Xiaohongshu, VIPKID, Master Education, Hangyin Consumption, Wanyi Technology, DuoduoDMall, Zujiang Technology, Enjoy wisdom, Aijia Life, Yonghui Technology Center, Follow who to learn, Foreign users include Lyft, Bringg, Arkane Systems, and others.
-
Maintainer has five members from Alibaba, Tencent, and Lyft
-
51 Contributors
-
Domestic: Aliyun, Ant Group, Ctrip, Tencent, Pinduoduo…
-
Foreign countries: Microsoft, Lyft, Spectro Cloud, Dsicord…
-
-
2000+ GitHub Stars
-
300+ Forks
If you are interested in the OpenKruise project and have any topics you would like to talk about, please visit the official website of OpenKruise and GitHub