Content source: yq.aliyun.com/articles/46…
Read the word count: 3994 | 10 minutes to read
Abstract
This content is shared by Li Xuefeng, a senior technical expert from Alibaba computing platform, with the theme of “Multi-tenant Isolation Practice of Financial Level Big Data Platform”. In the sharing, Li Xuefeng first introduced the problems of isolation based on traditional IaaS single-tenant architecture. He then focuses on the multi-tenant architecture at the MaxCompute PaaS level and MaxCompute’s specific practices in security isolation.
IaaS single-tenant Big data product architecture
The iaAS-based single-tenant big data product architecture is shown in the figure above, and the underlying architecture is usually implemented by HDFS2. Build resource management and control platforms such as Hadoop Yarn or MESOS based on HDFS2. Then implement specific computing models, such as MR, Hive, HBASE, and Spark. In such an ecosystem, IaaS is usually used as the same tenant. When users have new requirements, IaaS applies for clusters (virtual machines), and corresponding open source products are deployed on these clusters. From the perspective of isolation, this ecology faces the following problems:
First, the IaaS single-tenant big data product architecture has certain logical problems in practical use. When analyzing data, you need to understand the logic of each product. For example, you need to understand the logic of Hive when running SQL and Learn Spark when using Spark. When the number of products used is small, the use cost can be effectively controlled; When multiple products are needed to assist each other, the cost of learning increases exponentially. In addition, the logical models of two different open source products cannot be identified with each other, and the logical problems are more prominent when authentication and other problems are encountered.
Second, each open source product has its own priority definition at the runtime level. When using the same open source product, the priority of the task will be run according to the priority system of the open source product itself. The task with higher priority will get more resources than the task with lower priority, and the running time will be better guaranteed. When multiple open source products are used at the same time, IaaS single-tenant big data product architectures cannot achieve global optimization of operation priority.
Finally, these open source products often provide user-defined logic, such as MR or Hive provided UDFs. When user – defined code runs in big data products, it may cause security risks. For example, Hadoop Yarn uses the simple Linux Container mechanism for isolation when running user-defined code. When using this mechanism to run isolation, the user’s code logic and Hadoop’s own process run under the same kernel. That is to say, if the attack program contained in this part of user code logic can affect the machine kernel, the process of big data products running under the same kernel will also be affected. Typically, a Job in a big data product will run on most or all machines in the cluster at the same time, depending on the size of the data shard. In this case, the security risk is magnified to the entire cluster. At one extreme, a successful attack on a machine using a vulnerability in the kernel can bring down the entire computing cluster if the submitted Job fragments are large enough.
After recognizing the above issues, MaxCompute provides multi-tenant capabilities at the PaaS level by developing its own overall system architecture.
MaxCompute PaaS multi-tenant architecture
Figure 1 shows the MaxCompute PaaS multi-tenant architecture. As can be seen from the figure, MaxCompute runs on feitian operating system and relies on Feitian Fuxi module to provide unified resource control. Rely on apsaras Pangu module to provide unified storage; Rely on the Feitian Nuwa module to provide consistent services. MaxCompute provides multiple computing modes for the upper layer by providing the same computing engine, including SQL, MR, graph computing, PAI, and quasi-real-time computing.
At present, this computing engine has provided corresponding computing power for financial users on the public cloud.
In order to solve the MaxCompute multi-tenant problem, there are three main perspectives:
One is logical isolation. From the perspective of tenants, each tenant has its own independent logical model, its own independent resources and a unified authorization model based on the same logical model.
Second, resource isolation. For tasks of different tenants, the MaxCompute runtime implements unified and globally optimal task scheduling and resource isolation capabilities.
The third is to run the isolation mechanism. Currently, MaxCompute provides functionality for user-defined logic (such as the Python UDF), which provides a complete run isolation mechanism for user-defined logic running on MaxCompute.
Let’s take a look at the three isolation mechanisms provided by MaxCompute.
MaxCompute logical isolation
Currently, a unified tenant system can be provided at the logical level for the same MaxCompute instance, regardless of how many physical clusters it runs on. For this tenant system, the data resource view and permission management model of the same tenant are unique and bound to the tenant model. In practice, the tenant of MaxCompute corresponds to the Project of MaxCompute, which contains the tenant’s resources, attributes, and permissions.
As shown in the figure above, Project consists of Properties, Subject and Object. Properties include Quota, Owner, Payment Account, Region and other information. All authorization access within a Project is subject to the User ID. Based on these topics, MaxCompute provides a role model to implement authorization aggregation. The resources to be operated by the computing models mentioned above (MR, Hive, etc.) eventually reside in a certain entity in the Project, for example, SQL model corresponds to Table entity, UDF corresponds to Function entity.
Based on the preceding logical model, MaxCompute provides a complete set of authentication and authorization mechanisms for controlling permissions. First of all, all permissions come from Project Ower. As the Owner of the Project, it has all permissions within the Project. When any user uses the Project for calculation, the Project Owner shall first authorize it (specifically by using GTANT statement). When the User accesses Project, it will read and write tables, create functions, add and delete resources as User ID. Before these operations are performed, the system checks whether the current User ID has the corresponding permission through unified ACL logic.
The figure above shows the operations supported by MaxCompute for different types of objects. For details, see the official documents.
MaxCompute Resource isolation
The computing engine of MaxCompute relies on the Flying OS to provide resource running and isolation capabilities.
As shown in the figure above, when different tasks job-0 and Job-n are submitted to Fetian Fuxi module, the Fuxi scheduling system will assign the running level of tasks according to the running level of different users, which corresponds to the attributes in the Project mentioned above. The Fuxi module transforms job-0 and Job-N tasks into Fuxi tasks. It is then scheduled to nodes in the computing cluster. Finally, on the computing cluster, the same server will run tasks of multiple tenants at the same time, and these tasks are run in the form of Fuxi Worker.
For one of the machines, when the Fuxi engine on the machine receives the Worker Plan, it will configure the Cgroup parameters on the current machine according to the quota value of the user corresponding to the Worker. In this way, jobs submitted by different users run on physical machines with different Cgroup configuration parameters. Currently, MaxCompute relies on the Cgroup capability provided by the Linux Kernel to plan the CPU and Memory resources of a specific process on the physical machine.
MaxCompute runs isolation
Finally, we examine the run isolation mechanism provided by MaxCompute to safely run user-defined logic. When Fuxi runs user-defined code logic, it pulls an isolated environment and runs the user’s code in an isolated process. For Fuxi, this process is no different from other processes, but its running environment is in an isolated system. In other words, for Fuxi, this process is a normal process, but for untrusted Code process is isolated.
Operational isolation can be divided into process isolation, device isolation and network isolation.
Process isolation
In terms of process isolation, for a single process, running a process with untrusted code (which may include malicious code) can cause damage to the computing platform. To address this problem, MaxCompute provides a multi-layer isolation nesting scheme to circumvent this potential security risk. Internally, MaxCompute provides language-level sandboxes, including the Java sandbox and the Python Sandbox. This language-level sandbox provides the innermost layer of isolation for user code. For example, the Java UDF can currently restrict the loading of specific classes. Python UDFs can do function-level restrictions; In addition, MaxCompute provides process isolation, which relies on the current Linux Kernel mechanism to implement process isolation, including namespace, cgroup, secomp-bPF, etc. On the outermost side, MaxCompute implements a layer of lightweight virtualization by deeply customizing the Linux Kernel and a minimal Hypervisor to provide very lightweight virtual machines (set up in a few hundred milliseconds). As a result, untrusted code ends up running on the physical machine as a hypervisor; That is, to Fuxi, it sees only hypervisor processes, but to untrusted Code, it sees an isolated environment.
Equipment isolation
In addition, MaxCompute also provides hardware acceleration capabilities for user-defined code, such as PAI support for direct GPU access. Currently, MaxCompute passes the GPU card through the VM using PCIE passthrough, allowing the guest process to access the GPU through the PCIE bus and GPU driver in the Guest kernel.
In this way, the VM accesses the GPU through the PCIE bus, and the performance is similar to that of the GPU on a physical machine. On the other hand, you do not need to install the GPU driver on a physical machine, avoiding the impact of the GPU driver on platform stability and reliability.
Network isolation
On some products, MaxComputer provides network isolation for user code logic. A layer of virtual network is implemented between VMS pulled up by Fu Xi. These VMS can communicate directly over a virtual network, which also provides good compatibility for running some open source code inside the VM. At the same time, as can be seen from the figure above, the user-defined code logic does not directly access the physical network; The Tursted code pulled by Fuxi, including the code on the MaxCompute framework, communicates through the physical network, which ensures the low delay in the communication of the MaxCompute framework.