Author Yang Ning (Lin Tong) Ali Cloud basic product Division senior security expert Liu Zixi (mo Bai) Ant Financial big security foundation security expert Li Tingting (Hong Shan) Ant Financial big security foundation security senior security expert

Introduction to the

Zero-trust security was first developed by John, principal analyst of Forrester. Kindvig proposed in 2010. Zero trust security reevaluates and reviews the traditional border security architecture and gives some new suggestions for security architecture.

The core idea is that no one/device/system, both inside and outside the network, should be trusted by default, and the trust foundation for access control needs to be reconfigured based on authentication and authorization. Such as IP address, host, geographical location, network, etc. cannot be used as trusted credentials. Zero trust overturns the paradigm of access control and leads the security architecture from “network centric” to “identity centric”, which essentially demands identity centric access control.

At present, the concept of Zero Trust on the ground includes Google BeyondCorp, Google ALTS, Azure Zero Trust Framework, etc. Zero Trust system on the cloud is still an emerging technology trend. The same zero-trust model is also applicable to Kubernetes. This paper focuses on the technical analysis of the zero-trust security architecture under Kubernetes.

The traditional concept of zero trust and its current implementation

Microsoft Azure

Azure’s zero trust is relatively complete, covering end-to-end, cloud, on-permises, SaaS and other applications from an architecture perspective. Here we analyze the relevant components:

  • The user Identity: Then, you can use an Identity Provider (component that creates, maintains, and manages user identities) to authenticate users. During re-authentication, you can use the account password or Multi Factor Auth (MFA) authentication. Multi-factor authentication includes soft and hard Token, SMS, human characteristics, etc.
  • Device Identity: Device information, including IP address, MAC address, installed software, OPERATING system version, and patch status, is stored to Device Inventory. Device Identity: Device information, including IP address, MAC address, operating system version, and patch status, is stored to Device Inventory. In addition, the device also has a corresponding Identity to prove the device’s Identity. Equipment will have the corresponding equipment status, equipment risk to determine;
  • The Security Policy Enforcement: After collecting user Identity, status and device information, status and Identity, SPE policy can be comprehensively judged. Meanwhile, Threat Intelligence can be combined to enhance the scope and readiness of SPE policy judgment. Examples of policies include access to subsequent Data, Apps, Infrastructure, and Network;
  • Data: A policy for classifying, labeling, and encrypting Data (Documents).
  • Apps: adaptive access to corresponding SaaS applications and on-Permises applications;
  • Infrastructure: Includes IaaS, PaaS, Container, Serverless, JIT (on-demand access) and GIT version control software;
  • Network: Policy through the Network delivery process and internal microisolation.

In a more detailed illustration from Microsoft, users (employees, partners, users, etc.) include Azure AD, ADFS, MSA, Google ID, etc. Devices (trusted compliance devices) include Android, iOS, MacOS, Windows, Windows Defender ATP, clients (client APP and authentication method) include browsers and client applications, locations (physical and virtual addresses) include address location information, company network, etc. Continued adaptive access to On-Permises, Cloud SaaS Apps, Microsoft Cloud using Microsoft machine learning ML, real-time evaluation engine, strategy, and other comprehensive user, client, location, and device determinations. The policies include Allow and Deny, restrict access, enable MFA, force password reset, and prevent or lock illegal authentication. As you can see from the chart below, Azure has been able to get through on-permises, Cloud, SaaS, etc. to build a large and complete zero-trust system.

Google BeyondCorp

Google BeyondCorp is a network security solution to deal with new network threats. In fact, Google BeyondCorp itself does not have too many technical updates, but uses an idea of continuous verification to do so, eliminating V-P-N and no longer dividing internal and external networks. Google before 2014 is forecast to the safety of the Internet and Intranet is dangerous, because it is a network boundary breakthrough, attackers can easily access enterprise of some internal applications, due to problems of the safety consciousness companies will think my inside is very safe, I will for the processing of low priority level, the application of internal Resulting in a large number of internal security problems. Today’s enterprises are increasingly using mobile and cloud technologies, making border protection increasingly difficult. So Google simply does not discriminate, internal and external, with the same security measures to defend.

From the Angle of attack and defense on Google BeyondCorp model, such as access inside Google applications, blackberry.corp.google.com, It will jump to login.corp.google.com/, also known as Google Moma system. First, you need to enter your account and password to login. During the login process, you will make a comprehensive judgment based on device information and user information. It will continue to jump to the login interface that requires YubiKey. Every Google employee will have YubiKey and use YubiKey for secondary verification. The value of Yubikey, Google believes, is that it can completely eliminate phishing attacks. Similarly, Amazon’s Midway Auth approach (midway-auth.amazon.com/login?next=…) .

Kubernetes container zero trust model

Zero network trust under the container

Let’s start by introducing Calico, the network zero trust component under containers. Calico is an open source network and network security solution product for containers, virtual machines, and host-based native Workload. Calico supports a wide range of platforms, including Kubernetes, OpenShift, Docker EE, OpenStack, and bare metal services. The greatest value of zero-trust networks is that they are resilient even if an attacker destroys an application or infrastructure through various other means. The zero-trust architecture makes it difficult for attackers to move sideways, and targeted tread activity is easier to detect.

Calico+Istio is a hot solution in container network zero-trust system. Istio is for the Pod Workload layer and Calico is for the Node layer:

Istio Calico
Layer L3-L7 L3-L4
implementation User mode Kernel mode
Policy execution point Pod Node

To highlight some of the technical details of Calico components and Istio, Calico builds a 3-tier routable network using Calico’s Felix (a daemon running on Node, which runs on each Node resource). Felix is responsible for compiling routing and ACL rules and anything else needed on the Node host to provide network connections for the resources on the host to run properly.) Runs on each Node, mainly doing routing and ACL policies and setting up networks; Fine-grained access control through Iptables running on Node. The default Deny policy can be set through Calico, and then the minimum access control policy can be implemented through adaptive access control, so as to build the zero-trust system under the container. Dikastes/Envoy: Optional Kubernetes Sidecars that can protect Workload to Workload traffic with mutual TLS authentication and add related control policies;

Istio

Before moving on to Istio, let’s talk about some security requirements and risk analysis for microservices:

1. After the microservice is broken through, the traffic is monitored through the Sniffer, and then the man-in-the-middle attack is carried out. In order to solve this risk, the traffic needs to be encrypted. 2. In order to control access between microservices and microservices, two-way TLS and fine-grained access policies are needed; 3. Audit tools are needed to audit who did what and when;

After analyzing the risks, let’s explain how Istio implements a zero-trust architecture. Firstly, it is obvious that the whole link is encrypted by bidirectional mTLS. Secondly, access between micro-services and micro-services can also be authenticated. After access through permissions, auditing is also required. Istio is the separation of the data side from the control side, which delivers authorization policy and security naming information to the Envoy via Pilot, and the data side, which communicates microservices via Envoy. An Envoy is deployed on each microservice’s Workload, and each Envoy agent runs an authorization engine that authorizes requests at run time. When the request arrives at the broker, the authorization engine evaluates the request context based on the current authorization policy and returns the authorization result ALLOW or DENY.

Zero Trust API security under microservices

42Crunch (42crunch.com/) extends API security from the edge of the enterprise to each individual microservice and is protected with ultra-low latency microAPI firewalls that can be deployed on a large scale. The 42Crunch API firewall is deployed in Kubernetes Pod in Sidecar proxy mode with millisecond performance response. This eliminates the process of writing and maintaining a single API security policy, and implements a zero-trust security architecture, which improves THE API security under microservices. 42Crunch’s API security capabilities include: Audit: Runs more than 200 security audit tests defined by the OpenAPI specification and performs detailed security scores to help developers define and enforce API security; Scanning: Scanning real-time API endpoints for potential vulnerabilities Protect: Protect the API and deploy a lightweight, low-latency Micro API Firewall on the application.

Best practice of ant zero-trust architecture

With the evolution of Service Mesh architecture, ants have begun to implement the Service authentication ability under Workload scenarios internally. How to build a set of Service authentication ability in accordance with Workload of ant architecture can be divided into the following three sub-questions:

1. How to define Workload’s identity and how to achieve a set of common identity identification system; 2. Realization of Workload access authorization model; 3. How to choose the execution point of access control.

Workload identity definition & Authentication mode

Internally, ants use the Identity format given in SPIFFE project to describe Workload’s Identity, namely:

spiffe://<domain>/cluster/<cluster>/ns/<namespace>
Copy the code

However, in the process of project landing, it is found that the granularity of identity format of this dimension is not fine enough, and there is a strong coupling with K8s partition rules for namespace. The size of ants is large and there are many scenes. The rules for dividing namespace in different scenes are not completely consistent. Therefore, we adjusted the format, sorted out a set of essential attributes (such as application name, environment information, etc.) needed to identify a Workload example in each scenario, and carried these attributes in Pod Labels. The adjusted format is as follows:

spiffe://<domain>/cluster/<cluster>/<required_attr_1_name>/<required_attr_1_value>/<required_attr_2_name>/<required_attr _2_value>Copy the code

In conjunction with this identity format standard, we have added a Validating Webhook component to the K8s API Server to validate the attribute information that must be carried in Labels as described above. If one of the property information is missing, the instance Pod cannot be created. As shown below:

Once the Workload identity definition is resolved, all that remains is to convert the identity into some verifiable format that can be passed through the service invocation link between workloads. To support different usage scenarios, x.509 certificates and JWT formats have been selected.

For the Service Mesh architecture scenario, we store the identity information in the Subject field of the X.509 certificate to carry the Workload identity information. As shown below:

For other scenarios, we store identity information in Claims of JWT, which is serviced by Secure Sidecar for issuance and verification. As shown below:

Authorization model

At the initial stage of project landing, the RBAC model was used to describe the authorization policy for Workload inter-service invocation. For example, A service of application A can be invoked only by application B. This authorization strategy worked well in most scenarios, but as the project progressed, we found that this authorization strategy did not work well in some specific scenarios.

Let’s consider A scenario where there is an application A inside the production network whose responsibility is to provide centralized services for some dynamic configuration that all applications inside the production network need to use at run time. This service is defined as follows: A application – Get A dynamically configured RPC service:

message FetchResourceRequest {
// The appname of invoker
string appname = 1;
// The ID of resource
string resource_id = 2;
}
message FetchResourceResponse {
string data = 1;
}
service DynamicResourceService {
rpc FetchResource (FetchResourceRequest) returns (FetchResourceResponse) {}
}
Copy the code

In this scenario, if the RBAC model is still used, the access control policy for application A cannot be described because all applications need to access the services of application A. However, this leads to obvious security issues, as the caller application B can obtain resources from other applications through this service. Therefore, we upgraded the RBAC model to ABAC model to solve the above problems. We use the DSL language to describe the logic of ABAC and integrate it into Secure Sidecar.

Access control execution point selection

In terms of execution point selection, considering that the Service Mesh architecture will take some time to advance, we provide two different approaches that are compatible with both the Service Mesh architecture and the current scenario.

In the Service Mesh architecture scenario, the RBAC Filter and the Access Control Filter (ABAC) are integrated into the Mesh Sidecar.

In the current scenario, we currently provide the JAVA SDK, and the application needs to integrate the SDK to do all the authentication and authorization-related logic. Similar to the Service Mesh architecture scenario, all identities are issued, verified, and authorized to interact with Secure Sidecar.

conclusion

The core of zero Trust is “Never Trust, Always Verify”, which will continue to deepen the practice of zero Trust in the whole Alibaba in the future, endowing different roles with different identities, such as enterprise employees, applications and machines, and sinking access control points to each point of cloud native infrastructure to achieve global fine-grained control. Create a new boundary for security protection. This paper gives a simple description from the best practice of zero-trust system implementation in the industry to the zero-trust implementation method based on Kubernetes. This paper is just a piece of paper, hoping to trigger more discussions on zero-trust architecture system under Cloud Native and see more excellent solutions and products in the industry.

This book highlights

  • In the practice of double 11 super scale K8s cluster, the problems encountered and solutions are detailed
  • Best combination of Cloud biogenesis: Kubernetes+ Container + Dragon, to achieve 100% cloud on the core system technical details
  • Double 11 Service Mesh large-scale landing solution

“Alibaba Cloud originators pay close attention to technical fields such as microservice, Serverless, container and Service Mesh, focus on cloud native popular technology trends and large-scale implementation of cloud native, and become the technical circle that knows most about cloud native developers.”