The Linux Foundation has announced the formation of the Confidential Computing Consortium, a Consortium of the world’s largest companies, as a viable solution to growing concerns about data security and privacy. Become the focus of attention of Internet giants. Ant Financial has long paid attention to this kind of technology, and built SOFAEnclave, ant Financial’s new generation of trusted programming middleware, based on confidential computing, to escort financial business. Secret computing is a part of ant security computing and also an important territory of financial cloud native. Ant Financial said that it believes that secret computing will become a standard part of cloud computing like HTTPS in the future.
The author | 闫守孟, XiaoJunXian, Tian Hongliang
The introduction
Internet finance is essentially the processing of a large amount of sensitive data and the resulting precipitation of critical business intelligence. In recent years, new business forms have expanded the scope of data processing from unilateral data to multi-party data involving partners.
On the other hand, from GDPR to HIPAA, the scope of data privacy regulatory protection is increasingly expanded and strengthened. It can be seen that the security protection of financial data and key business intelligence is not only the basis of Internet financial business, but also the basis for its innovation and development, as well as a key factor of compliance.
The TCB trusted Computing Base (TCB) only contains the application itself and the basic hardware. The TCB is an innovative data isolation and encryption technology developed rapidly in recent years. Sensitive data and code can remain secure even if privileged software such as the OS kernel, the Hypervisor, and even the BIOS is compromised or even malicious.
In its own practice, Ant Financial has developed the financial grade confidential computing middleware based on the underlying technology of confidential computing to ensure the confidentiality and integrity of financial application data and codes, and provide easy-to-use, secure and clustered computing environment for key businesses.
This paper introduces the technical background, key problems, technological breakthrough of ant and typical application scenarios of secret computing.
The technical background of confidential computing
With the rapid development of cloud computing, more and more critical services and high-value data are migrated to the cloud. Cloud security has therefore become a focus of academic and industrial attention.
One of the most important advances in cloud security in recent years is called Confidential Computing. Confidential computing fills a gap in current cloud security — the encryption of data-in-use. It used to be common practice to encrypt data in storage (such as hard disk) and in transit (such as the network) and to decrypt it in use (such as memory) for processing. Confidential computing protects the confidentiality and integrity of data in use.
A number of cloud computing giants are pushing the technology in tandem: Microsoft announced in July 2017 that it was accepting early trial applications for Azure Confidential Computing; IBM announced a preview of IBM Cloud Data Guard in December 2017; Google also launched a confidential computing framework called Asylo in May 2018.
So how exactly does secret computing work?
In fact, all of these cloud giants rely on a technology called Trusted Execution Environment (TEE) for secret computing.
As the name suggests, TEE provides a secure computing environment isolated from untrusted environments, and it is this isolation and trust verification mechanism that makes confidential computing possible.
TEE is generally implemented directly based on hardware, such as Intel SGX, AMD SEV, ARM TrustZone, and RISC-V Keystone. TEE can also be constructed based on virtualization technologies, such as Microsoft’S VSM, Intel’s Trusty for iKGT & ACRN, but it cannot match the security of hardware TEE.
Among them, Intel Software Guard Extensions (SGX) is the most advanced TEE implementation in commercial cpus. It provides a new set of instructions so that users can define a secure memory area called Enclave. The CPU keeps the Enclave isolated from the outside world, protecting the confidentiality, integrity, and verifiability of the code and data inside. Different from previous TEE implementations, such as ARM TrustZone, each SGX APP can have its own TEE, or even create multiple Tees, while TrustZone has one TEE for the whole system. It also saves the process of applying to the device manufacturer to put TA into TEE. Due to the advanced nature of SGX, the word Enclave is now even accepted in the cloud computing community to refer to TEE.
The typical Enclave achieves security goals that can be summed up in the CIA: Confidentiality, Integrity and Authenticity. The implementation has the following basic requirements:
- Enclave memory protection
Enclave memory is accessible only to the code of the Enclave itself. The CPU protects against software attacks on secure memory and hardware sniffing through memory isolation and encryption. SGX also prevents physical tampering of Enclave memory through integrity Tree of memory controller.
- Enclave trust verification
The CPU supports measurement of data and code in the Enclave, as well as local or remote validation of the Enclave’s legitimacy. With measurement and verification, identities can be authenticated between local enclaves and between clients and remote enclaves to establish secure communication channels.
How do I develop applications protected by Enclave?
In the case of SGX, one way is to leverage the Intel SGX SDK. As shown in the figure below, the SGX SDK-based application has two parts: the untrusted component outside the Enclave (yellow on the left) and the trusted component inside the Enclave (green on the right). The two sides can communicate through cross-enclave function calls: untrusted components can call functions defined in trusted components through ECall. Conversely, trusted components can also call functions defined in untrusted components through OCall.
Key issues facing secret computing
The Enclave gives us the security of the CIA mentioned above, but currently faces a major usability problem. Mainly reflected in several aspects.
First, the original application needs to be divided into two parts, one is the untrusted part outside the enclave and the other is the trusted part inside the enclave.
Second, the interface between the two parts needs to be carefully designed to plan when to enter the Enclave and when to exit the Enclave — which has a technical barrier and can be tedious and error-prone.
Third, even if we were perfectly partitioned, the environment inside the Enclave is very limited compared to the normal Linux operating environment with which we are familiar. For example, enclave does not make system calls, liBC and PThread are incomplete, openMP is not available, multi-process support is lacking, and so on.
As you can see, porting applications to Enclave can be challenging and in some cases impossible. Moreover, due to the complicated and trivial aspects unrelated to business must be considered in the development process, even if the application development and transplantation goal can be achieved, it will lead to low development efficiency and high development cost, which is unacceptable for the fast-paced Internet business.
Another big problem facing confidential computing towards engineering is how to scale it from a single node to a cluster. In the absence of standard practices or a best practice as a reference, many times each business has to build its own Enclave cluster infrastructure that is over-coupled to business logic from scratch. Thus resulting in low development efficiency and repeated resource input.
On the other hand, Internet services are increasingly using the cloud native container-> K8S -> Serverless technology stack. How to combine the confidential computing cluster with the cloud native technology stack is still a difficult problem.
SOFAEnclave: Ant Financial’s secret computing innovation
As a leading Internet financial enterprise in China, Ant Financial has a large demand for data protection, so it has carried out abundant business innovation and technological exploration around confidential computing. This section mainly introduces ant Financial’s innovative achievement in this regard, namely SOFAEnclave confidential computing middleware, aiming at the key problems of confidential computing mentioned above.
SOFAEnclave is a part of SOFAStack, ant Financial’s financial distributed architecture. Since 2007, SOFAStack has been generated from ant Financial’s internal requirements and was originally designed to solve business problems under the rapid development. By 2019, the business has been honed for 12 years and is a mature set of financial grade best practices. Since 2018, Ant Financial has announced that it will contribute SOFAStack to the open source community. So far, it has contributed more than 10 core projects, which have been widely concerned by the community.
SOFAEnclave focuses on securing the underlying infrastructure, building a layer of trusted middleware for data and code. Our overall goal is to overcome the challenges of business shielded Enclave development and the complexity of confidential computing clusters by improving the ease of use to keep the business development and deployment habits intact. In a word, make the business focus on the business.
The core of SOFAEnclave consists of three parts: the Enclave kernel Occlum, the cloud native secret computing cluster KubeTEE, and the security testing and analysis framework. We’re going to focus on Occlum and KubeTEE.
Occlum LibOS: A secure and efficient secret computing kernel
For the ease of use of Enclave, we designed an Enclave kernel named Occlum and developed it as an open source project in community mode. Compared with the operating system kernel, Occlum LibOS provides complete system services to trusted applications in the Enclave, and applications are protected by the Enclave without segmentation and modification.
Occlum is compatible with POSIX programming interfaces and supports multithreading, OpenMP, and multi-processes; In addition, Occlum implements multi-process isolation to isolate multiple trusted applications. Occlum makes it easy for developers to make use of Enclave’s CIA capability to achieve the effect of being invisible and not being able to attack, so that data protection can really be implemented.
Occlum project address: github.com/occlum/occl…
Currently, Occlum easily supports large AI frameworks such as XGBoost and TensorFlow, as well as large Server applications such as Shell, GCC and Web Server. Occlum has the following technical features:
- Memory safety
Memory security is the most common security risk in system software. Race conditions, buffer overflows, null Pointers, stack overflows, heap exhaustion, post-release access, or re-release are all terms used to describe memory security vulnerabilities. Microsoft said in February 2019 that approximately 70 percent of all Microsoft patches over the past 12 years were for memory security vulnerabilities. Therefore, preventing memory security problems is very important to the security and robustness of system software.
Occlum was the industry’s first memory-safe SGX LibOS. The Occlum LibOS was developed based on the memory-safe Rust language, and contained only a handful of Unsafe Rust, C, and assembly code (less than 1000 lines). This makes it hard for the lowest level memory security-related bugs and vulnerabilities to show up in Occlum. As a result, compared to traditional SGX LibOS developed using C/C++ (such as Graphene-SGX and SCONE), Occlum is more reliable as a foundation for developing high-security applications.
- Simple and easy to use
Occlum LibOS allows Linux applications to run in the Enclave security environment with little or no code changes. The user simply compiles the application using the occlum-clang tool chain and runs the application in the SGX enclave using a command line tool named Occlum. The command line tool provides a number of subcommands, the three most important of which are: <program_name> <program_args> : Run a program in the Occlum trusted image in the enclave.
Occlum significantly reduced the development cost of developing applications for Enclave. Let’s take the simplest example of Hello World. The SGX Hello World project developed using Intel SGX SDK contains about 10 files and about 300 lines of code; Using Baidu’s Rust SGX SDK requires about 200 lines of code; Google’s Asylo also requires around 100 lines of code. In contrast, Occlum doesn’t require the user to add any extra code to the Linux version of Hello World (5 lines of code), and it only takes three lines to run the Linux version of Hello World in the SGX enclave, which looks like this:
- Efficient multiprocess
Any application on LibOS runs as a process, and applications tend to be made up of multiple processes. Therefore, LibOS’s efficient support for multiple processes is critical. However, the existing SGX LibOS support for multiple processes is not satisfactory. The closed source SCONE only supports multiple threads, not yet multiple processes. The open source graphene-SGX is currently the most mature SGX LibOS and can support multiple processes, but each of its LibOS processes must run in a separate SGX enclave, and each enclave must run a separate instance of LibOS. This n-process-n enclave architecture guarantees strong isolation between LibOS processes, but it also causes performance and functional problems:
① The process starts slowly: Graphene-sgx creates a separate enclave for each LibOS process, and the overhead of creating the enclave is very high, so graphene-SGX LibOS processes start extremely slowly (nearly 10,000 times slower than Linux startup processes).
② HIGH IPC overhead: Each LibOS process of graphene-SGX is completely isolated from the outside world by an enclave, so communication between LibOS processes must rely on the untrusted cache outside the enclave and transmit encrypted data. Encryption and decryption greatly increase the overhead of interprocess communication.
③ It is difficult to guarantee consistency: Graphene-SGX has N processes and N instances of LibOS, and in principle, these N instances of LibOS should provide a consistent OS state to the upper layer applications, such as encrypted file systems. But it is obviously difficult to synchronize the state of the file system (such as the key for each file block) across multiple instances of LibOS. This is why graphene-SGX has not yet provided an encrypted file system.
Unlike graphene-SGX, Occlum is a single address space LibOS, that is, multiple LibOS processes are running in the same enclave. This architecture is especially suitable for multi-process collaboration scenarios, in which multiple processes with mutual trust form the same application or service. The architecture of this “multi-process shared enclave” gives three advantages to Occlum’s multi-process support:
① Process startup was fast: the process startup of Occlum was 13-6600 times faster than that of graphene-SGX (Figure 4);
(2) Low IPC overhead: The interprocess communication bandwidth of Occlum is 3 times that of Graphene-SGX (FIG. 5);
③ Encrypted file systems: The Occlum supports transparent and writable encrypted file systems for applications, ensuring the confidentiality and integrity of metadata and data in the file system.
KubeTEE: Financial-grade cloud-native confidential computing cluster
In view of the problem of Enclave clustering, we consider how to use TEE resources to provide confidential computing services in a more efficient and concise way. Our solution is KubeTEE, which combines cloud native to provide confidential computing cluster services.
On the one hand, it avoids repeated infrastructure construction by service users. On the other hand, users can register their accounts to use the confidential computing cluster service, which greatly reduces the threshold of confidential computing and improves the ease of use and utilization. For more efficient use of physical resources, KubeTEE gracefully deplores and manages confidential computing images and EPC resources based on K8S + Containers. Based on the container scheduling capability of K8S, KubeTEE can quickly realize the horizontal expansion and shrinkage of confidential computing service resources. In general, we want to use Enclave and confidential computing cluster resources in a more cloud-native way.
(1) It provides Enclave Container based service deployment capability, but has no awareness upgrade capability for infrastructure operation and peacekeeping services.
② Provide Serverless confidential computing service, and support business services based on the general confidential computing resource pool;
③ Provide platform-based business development capabilities based on the combination of common confidential computing components, middleware services and R&D processes;
The figure above describes the process of realizing Serverless confidential computing cluster. On the one hand, we provide the final confidential computing service, and at the same time, we abstract the accumulated components into reusable modules to meet the customization requirements of different businesses and improve the Enclave development efficiency of confidential computing services.
Typical Application Scenarios
Confidential computing has a wide range of applications, including enclave-based copyright protection, biometric protection, genetic data processing, key protection, key management system, machine learning for privacy protection, encrypted data analysis, and confidential database. Other technologies, such as blockchain privacy computing, blockchain +AI, and privacy edge computing, can be built on the basis of confidential computing technology to better serve application scenarios. This section discusses two slightly complex application scenarios based on Internet services.
Enclave based multi-party cooperative learning
As we all know, there are two reasons for the development of ARTIFICIAL intelligence: one is the improvement of computing power, and the other is the growth of data scale. However, the business field and business audience of a single organization are limited, so its data accumulation is not comprehensive on the one hand, and on the other hand, it is difficult to form a scale.
In order to maximize the value of data, a natural idea is to gather multiple data for centralized mining. However, due to the concerns of business confidentiality and industry competition, it is impossible for organizations to share their data freely with others.
This leads to a seemingly paradoxical situation where multiple institutions compete and cooperate, share data and keep it secret (Figure 6) — what we call collaborative learning.
How to resolve this contradiction? One solution is to import the respective encrypted data into the Enclave, where it can be decrypted, aggregated, and mined. For detailed implementation details, please refer to the article of ant Financial Shared intelligence team.
AI model security protection
The AI model deployed externally carries a large amount of intellectual property rights. If it is reversed or leaked, it will not only damage the technical moat, but also reduce the difficulty of countervailing sample attack, leading to security problems.
One solution to this threat is for the user to encrypt the AI model and training/prediction data into the Enclave only when it is used, decrypt it inside the Enclave, and process it by the AI framework running inside the Enclave. The result is returned in plaintext or encryption and decrypted locally on the user’s side. This requires that Enclave be able to support common AI frameworks, which can be extremely challenging — both because these AI frameworks typically use complex multithreading, OpenMP and other performance-optimized environments, and because Enclave simply doesn’t provide them. This is why many Enclave support systems on the market struggle to support (or efficiently support) AI frameworks.
As mentioned earlier, Occlum LibOS made some progress in this area, making it easier to run common AI frameworks efficiently.
Summary and Prospect
Confidential computing is in the ascendant, the academic research is in full swing, and the industrial applications are increasingly rich and practical. Ant Financial is a technology explorer and business pioneer in the field of confidential computing, and we still have many problems that need the cooperation of the whole ecosystem.
We are gradually contributing the modules in SOFAEnclave to the open source community and welcome industry and academic colleagues to contact and cooperate. The Occlum LibOS will be used to support more practical applications to implement Enclave protected secure Container Enclave Container. In terms of KubeTEE, we hope to build an ecosystem with partners, maintain trusted applications and mirror warehouses, and promote the standardization of confidential computing clustering solutions. Project address: github.com/occlum/occl…
Financial Class Distributed Architecture (Antfin_SOFA)