The text before

This article is a bit sloppy… Without further ado, look at the picture!

Quote: Amit N, Wei M. The design and implementation of hyperupcalls[C]//2018 {USENIX} Annual Technical Conference ({USENIX}{ATC} 18). 2018:97-112.

And then the same long translation. Of course, the main force was Google, and I was responsible for a wave of manual review. PPT I also have, if there is a need can not bother the author, (the author if no one to remind the estimate is not back to you. I had to go to his student to get it, but I think there’s a website where he threw all his slides, but I forgot)

The body of the

Abstract

Virtual machine abstraction offers a variety of benefits and undeniably supports cloud computing. However, virtual machines are doubly double-edged because the hypervisors they run on must treat them as black boxes, limiting the information they can exchange. In this article, we described the design and implementation of a new mechanism, Hyperupcalls, that enables the Hypervisor to securely execute validation code provided by Guest virtual machines in order to transfer information. Hyperupcalls are written in C and have full access to Guest data structures, such as page tables. We provide a complete framework for easy access to familiar kernel functions from HyperupCall. Hyperupcalls are more flexible and less intrusive than state-of-the-art paravirtualization technology and virtual machine introspection. We demonstrated that not only can Hyperupcalls be used to increase Guest performance for some operations by up to 2 times, but hyperupcalls can also be used as a powerful debugging and security tool.

1 introduction

Hardware virtualization introduces the abstraction of virtual machines (VMS), enabling hosts called hypervisors to run multiple operating systems (OS) called Guest machines at the same time, each assuming they are running on their own physical machine. This is done by exposing hardware interfaces that mimic real physical hardware. The introduction of this simple abstraction led to the rise of the modern data center and cloud as we know it today. Unfortunately, virtualization is not without its drawbacks. Although the goal of virtualization is to keep the VM and Hypervisor separate from each other, this separation prevents both sides from understanding decisions made on the other side, a problem known as semantic gaps.

【The semantic gap characterizes The difference between two descriptions of an object by different linguistic description representations, for instance languages or symbols. According to Hein, the semantic gap can be defined as “the difference in meaning between constructs formed within different representation Semantic gaps represent differences between two descriptions of objects through different linguistic representations (such as language or symbols). According to Hein, semantic gaps can be defined as “differences in meaning between constructs formed in different representation systems”.

Addressing semantic gaps is critical to performance. Without information about Guest decisions, the Hypervisor may suboptimally allocate resources. For example, the Hypervisor cannot know which memory is free in the Guest virtual machine without knowing the state of its internal operating system, breaking the VM abstraction. Today, the most advanced hypervisors often bridge the semantic gap through paravirtualization [11,58], which enables Guest to understand the Hypervisor. Paravirturalization frees Guest from physical hardware interfaces and allows direct information exchange with the Hypervisor, improving overall performance by enabling the Hypervisor to make better resource allocation decisions.

However, paravirtualization involves executing code in the context of a Hypervisor and Guest machine. Hypercalls require the Guest to issue a request to be executed in the Hypervisor, just like a system call, and upcalls require the Hypervisor to issue a request to be executed in the Guest. This design brings with it many disadvantages. First, the paravirtualization mechanism introduces a context switch between the Hypervisor and the Guest machine, which may be substantial if frequent interaction between the Guest machine and the Hypervisor is required [7]. Second, the requester of the semi-virtual mechanism must wait for it to serve in another possibly busy context, or wake up the Guest if it is idle. Finally, the paravirtualization mechanism combines the design of the Hypervisor and Guest machine: the paravirtualization mechanism needs to be implemented for each Guest machine and Hypervisor, increasing complexity [46] and impeding maintainability [77]. Adding para-virtual capabilities requires updating guest VMS and hypervisors with new interfaces [69], and may introduce errors and attack surfaces [47,75].

【 each section of the program has a lot of external variables. Only a simple function like the Add is not external variables. Once you have a program with external variables, this program is not complete, can’t run independently. In order to make them run, you will give all external variables one by one to write some values in it. A collection of these values is called the context. For example, in C++ lambda expressions, [what is written here is context](int a, int b){…}.

A different technique, VM Introspection (VMI) [25] and reverse, Hypervisor Introspection (HVI) [72], aims to address some of the shortcomings of paravirtualization by introspecting other contexts, enabling communication transfers without context switching or prior coordination. However, these technologies are vulnerable: small changes in data structures, behavior or even security enhancements [31] can break introspection mechanisms or, worse, introduce security vulnerabilities. As a result, introspection is often relegated to the realm of intrusion detection systems (IDS), which can detect malware or misbehaving applications.

VMI Tools may be located inside or outside the virtual machine and act by tracking the events (Interrupts, memory writes, and so on) or sending the requests to the virtual machine. Virtual machine monitor usually provides low-level information like raw bytes of the memory. Converting this low-level view into something meaningful for the user is known as the semantic gap problem. Solving this problem requires analysis and understanding of the systems being monitored. VMI tools can be located inside or outside the virtual machine and perform operations by tracking events (interrupts, memory writes, etc.) or by sending requests to the virtual machine. Solving this problem requires analyzing and understanding the system being monitored. Virtual Machine Introspection (VMI) is a technology used to monitor the health of system-level Virtual machines externally. The monitor can be placed in another Virtual Machine, inside the VMM, or in any other part of the virtualization architecture. In VMI, the state of a VM can be broadly defined to include processor registers, memory, disks, network, and any hardware level events.

In this article, we described the design and implementation of HyperupCalls 1, a technology that enables the Hypervisor to communicate with Guest, such as UPcalls, but without such a context switch, such as VMI. This is achieved by using proven code that enables Guest to communicate with the Hypervisor in a flexible manner, while ensuring that Guest does not provide misbehaving or malicious code. Once a Guest registers a Hyperupcall, the Hypervisor can execute it to perform actions such as locating a free Guest page or running a Guest interrupt handler without switching to the Guest.

Hyperupcalls are easy to build: they are written in high-level languages like C, and we provide a framework that allows Hyperupcalls to share the same code base and build systems because the Linux kernel can be generalized to other operating systems. When the kernel is compiled, the toolchain converts HyperupCall into verifiable bytecode. This makes it easy to maintain Hyperupcalls. At boot time, the Guest virtual machine registers hyperupcalls with the hypervisor, which validates the bytecode and compiles it back to native code for performance. Once recompiled, the Hypervisor can call hyperupCall at any time.

We show that using Hyperupcalls can significantly improve performance by allowing the Hypervisor to proactively allocate resources, rather than waiting for the guest VIRTUAL machine to react through existing mechanisms. We built hyperupcalls for memory reclamation and handling internal processor interrupts (IPI) and showed performance improvements of up to 2 x. In addition to improving performance, HyperupCall enhances the security and debuggability of systems in virtual environments. We developed a HyperupCall that enables Guest to write protect in-memory pages without using dedicated hardware, and another that enables FTrace [57] to capture Guest and Hypervisor events in a unified trace, allowing us to gain new performance insights in virtualized environments.

Write-protect is the ability of a hardware device or software program to prevent new information from being written or old information from being changed. Usually, this means that you can read data, but not write to it. An example of a write protect switch on an SD card is shown here, used to turn write protection on and off on the card. Ftrace is a tracing utility built directly into the Linux kernel. Many distributions have enabled various Ftrace configurations in their latest releases. One of the benefits of Ftrace for Linux is the ability to see what is happening in the kernel.

This paper makes the following contributions:

  • We established a taxonomy of mechanisms to bridge the semantic gap between hypervisors and Guest, and placed Hyperupcalls in this taxonomy

  • We describe and implement hyperupcalls (§3) with the following:

    • Environment for writing Hyperupcalls and framework for using Guest code (§3.1)
    • Compiler (§3.2) and validator (§3.4) for Hyperupcalls, which addresses the complexity and limitations of validation code.
    • Registration (§3.3) and enforcement (§3.5) mechanisms for Hyperupcalls.
  • We prototyped and evaluated Hyperupcalls and showed that hyperupcalls can improve performance (§4.3, §4.2), security (§4.5), and debugability (§4.4).

2 Communication Mechanism

It is now widely accepted that in order to extract maximum performance and utility from virtualization, the Hypervisor and its Guest need to know each other. To this end, there are many mechanisms to facilitate communication between the Hypervisor and Guest. Table 1 summarizes these mechanisms, which can be broadly represented by the requester, the performer, and whether the mechanism requires Hypervisor and Guest coordination in advance.

In the next section, we discuss these mechanisms and describe how HyperupCall meets the need for a communication mechanism where the Hypervisor makes and executes its own requests without context switching. We begin by introducing the most advanced semi-virtual mechanism in use today.

2.1 Paravirtualization

Super call and uplink call. Today, most hypervisors leverage paravirtualization to communicate across semantic gaps. Two widely used mechanisms today are super calls, which allow a Guest to call services and upcalls provided by the Hypervisor, allowing the Hypervisor to make requests to the Guest. Paravirturalization means that the interfaces of these mechanisms are coordinated in advance between the Hypervisor and Guest [11].

One of the major disadvantages of uplink and super calls is that they require a context switch, as both mechanisms execute on opposite sides of the request. Therefore, these mechanisms must be invoked with care. Calling HyperCall or upcalls too frequently leads to high latency and wasted computing resources [3].

Another disadvantage of upcalls is that the request is handled by a Guest who may be busy with other tasks. If a Guest is busy or idle, there is an external penalty for waiting for the Guest machine to become idle or wake up. This can take an infinite amount of time, and the Hypervisor may have to rely on the penalty system to ensure that Guest responds within a reasonable amount of time.

Finally, by increasing the coupling between the Hypervisor and its Guest, the paravirtualization mechanism may be difficult to maintain. Each Hypervisor has its own semi-virtual interface, and each guest VM must implement each Hypervisor’s interface. Semi-virtual interfaces are not thin: Microsoft’s semi-virtual interface specification is 300 pages long [46]. Linux provides a variety of quasi-virtual hooks that the Hypervisor can use to communicate with the VM [78]. Despite efforts to standardize parvirtualized interfaces, they are incompatible with each other and have evolved over time, adding and even removing functionality (for example, Microsoft Thypervisor event tracking). As a result, most hypervisors do not fully support efforts to standardize interfaces, and professional operating systems look for alternative solutions [45,54].

Pre-virtualization. Previrtualization [42] is another mechanism by which a Guest requests services from the Hypervisor, but the request is provided within the Guest’s own context. This is done through code injection: the Guest side leaves a stub and the Hypervisor populates it with Hypervisor code. Previrtualization provides an improvement over Hypercalls because they provide a more flexible interface between Guest and Hypervisor. Arguably, there is a basic limitation to pre-virtualization: the code running in the Guest virtual machine is stripped and cannot perform sensitive operations, such as accessing shared I/O devices. Therefore, in pre-virtualization, hypercode running on the Guest side still needs to communicate with the privileged Hypervisor code using HyperCalls.

2.2 introspection

Introspection occurs when the Hypervisor or Guest tries to infer information from another context rather than communicating directly with it. Through introspection, no interface or coordination is required. For example, the Hypervisor might try to infer the state of a completely unknown Guest simply by looking at its memory access mode. Another difference between introspection and paravirtualization is that no context switch occurs: all code that does introspection is executed in the requester.

Virtual Machine introspection (VMI) When the Hypervisor introspects on a Guest, it is called a VMI [25]. VMI was first introduced to enhance VM security by providing intrusion detection (IDS) and kernel integrity checks from privileged hosts [10,24,25]. VMI is also used for checking and deduplicating VM state [1], and for monitoring and enforcing Hypervisor policies [55]. These mechanisms range from simply observing VM memory and I/O access patterns [36] to accessing VM OS data structures [16], and at the very end they can modify VM state or even inject processes directly into it [26,19]. The main benefit of VMI is that the Hypervisor can call VMI directly without a context switch, and the Guest VIRTUAL machine does not need to be “aware” to check that the VMI is running properly. However, VMI is fragile: a harmless change in the VM OS, such as a patch to add extra fields to a data structure, can cause VMI to not work properly [8]. Thus, VMI tends to be a “do your best” mechanism.

HVI. To a lesser extent, Guest may introspect the Hypervisor on which it is running, called Hypervisor introspection (HVI) [72,61]. HVI is commonly used to protect VMS from untrusted hypervisors [62] or malware to bypass Hypervisor security [59,48].

2.3 Extensible operating system

While the Hypervisor provides a fixed interface, OS research has shown that flexible operating system interfaces over the years can improve performance without sacrificing security. Exokernel provides low-level primitives and allows applications to implement high-level abstractions, such as memory management [22]. SPIN allows the extension of kernel functionality to provide application-specific services, such as specialized interprocess communication [13]. The key feature that makes these extensions work well without compromising security is the use of simple bytecode to express application requirements and run this code on the same guard ring as the kernel. Our work is inspired by these studies and our goal is to design a flexible interface between Hypervisor and Guest to bridge the semantic gap.

2.4 Hyperupcalls

This article introduced Hyperupcalls, which meet the needs of a mechanism for the Hypervisor to communicate with guest virtual machines that is coordinated (unlike VMI), executed by the Hypervisor itself (unlike UPcalls) and does not require a context switch (unlike Hypercalls). With Hyperupcalls, the VM coordinates with the Hypervisor by registering verifiable code. The Hypervisor then executes this code in response to events such as memory stress or VM entry/exit. To some extent, hyperupcalls can be thought of as upcalls performed by the Hypervisor.

In contrast to VMI, the code to access the VM state is provided by guest, so Hyperupcalls have complete knowledge of the guest VIRTUAL machine’s internal data structure — in fact, Hyperupcalls are built using the Guest VIRTUAL machine’s operating system code base and share the same code, simplifying maintenance. It also provides an operating system representation mechanism to describe its state to the underlying Hypervisor.

In contrast to upcalls that the Hypervisor makes asynchronous requests to guest VMS, hyperupCalls can be executed by the Hypervisor at any time, even when the Guest VMS are not running. With upcalls, the Hypervisor is at the mercy of the Guest, which may delay upcalls[6]. In addition, because upcalls operate like remote requests, upcalls may be forced to implement OS functionality differently. For example, when the remote page in Ballooning [71] is refreshed in the canonical technique used to identify free Guest memory, the Guest uses a virtual process to release memory pressure to free the page. With HyperupCall, the Hypervisor can scan Guest’s free pages directly, just like Guest kernel threads.

Blooning’s VMtools installed on the virtual machine include the BallooningDriver, which tells the Hypervisor which inactive memory pages can be recalled. This has no impact on the performance of the virtual machine application.

Hyperupcalls are similar to pre-virtualization because the code is transferred across semantic gaps. Transport code not only enables more expressive communication, but also improves performance and functionality by moving the execution of the request to the other end of the gap. Unlike pre-virtualization, the Hypervisor cannot trust the code provided by the virtual machine, and the Hypervisor must ensure that the high-call execution environment is consistent across calls.

3 architecture

Hyperupcalls are short verifiable programs provided by guest VMS to the Hypervisor to improve performance or provide other functions. Guest VMS provide hyperupcalls to the Hypervisor through the registration process at startup, allowing the Hypervisor to access guest VM operating system states and perform them after verification to provide services. The Hypervisor runs Hyperupcalls in response to events or when the guest status needs to be queried. The architecture of Hyperupcalls and the system we built to leverage them are shown in Figure 1.

Our goal is to make Hyperupcalls as easy to build as possible. To do this, we provide a complete framework that allows programmers to write Hyperupcalls using Guest operating system codebases. This greatly simplifies the development and maintenance of Hyperupcalls. The framework compiles this code into verifiable code, and the Guest registers with the Hypervisor. In the next section, we will describe how OS developers write HyperupCalls using our framework.

3.1 build Hyperupcalls

Guest operating system developers write hyperupCalls for each Hypervisor event they wish to handle. The Hypervisor and Guest agree on these events, such as VM entry/exit, page mapping, or virtual CPU (VCPU) preemption. Each HyperupCall is identified by a predefined identifier, much like the UNIX System call interface [56]. Table 2 shows an example of events that HyperupCall can handle.

3.1.1 Provide security codes

A key attribute of Hyperupcalls is that the code must not break the Hypervisor. To make HyperupCall secure, it must only access restricted memory areas indicated by the Hypervisor, run for a limited period of time without blocking, sleeping, or locking, and use only explicitly permitted Hypervisor services.

Because Guest is not trusted, the Hypervisor must establish a security mechanism that guarantees these security attributes. There are many solutions we can choose from: Software Fault Isolation (SFI) [70], code carrying evidence [51] or a secure language such as Rust. For hyperupcalls, we chose the enhanced Berkeley Packet Filter (eBPF) VM.

We chose eBPF for several reasons. First, eBPF is relatively mature: BPF was introduced more than 20 years ago and is widely used throughout the Linux kernel, initially for packet filtering but extended to support other use cases, such as sandbox system calls (SECCOMP) and kernel event tracing [34]. EBPF is widely adopted and supported by a variety of runtimes [14,19]. Second, eBPF can be demonstrated to have the required security attributes, and Linux comes with a validator and JIT for validating and executing eBPF code efficiently [74]. Finally, eBPF has an LLVM compiler back end that uses a compiler front end (Clang) to generate eBPF bytecodes from high-level languages. Since operating systems are typically written in C, the eBPF LLVM back end provides us with a simple mechanism to convert insecure Guest operating system source code into verifiable secure eBPF bytecode.

3.1.2 Framework from C to eBPF

Unfortunately, writing a Hyperupcall is not as simple as recompiling OS code in eBPF bytecode. However, our framework is designed to make the process of writing Hyperupcalls as simple and maintainable as possible. The framework provides three key features that simplify writing Hyperupcalls. First, the framework handles Guest address translation, so the Guest OS notation is available for HyperupCalls. Second, the framework addresses the limitations that eBPF imposes on C code. Finally, the framework defines a simple interface that provides data for HyperupCall so that it can be executed efficiently and securely.

Guest operating system symbols and memory. Even though hyperupcalls can access the entire physical memory of the Guest vm, accessing the operating system data structures of the Guest VM requires knowing where they reside. Operating systems typically use kernel address space layout randomization (KASLR) to randomize virtual offsets of OS symbols so that they are unknown at compile time. Our framework resolves OS symbol offsets at run time by associating Pointers with address space attributes and injecting code to adjust Pointers. When registering hyperupCall, the guest virtual machine provides the actual symbol offset, enabling hyperupCall developers to reference OS symbols (variables and data structures) in C code as if they were accessed by kernel threads.

Global/local Hyperupcalls. Not all Hyperupcalls need to be executed in a timely manner. For example, notification of a Hypervisor event (such as VM entry/exit or interrupted injection) only affects the Guest machine and not the Hypervisor. We refer to hyperupcalls that affect only local hyperupcalls registered as guest VMS, and hyperupcalls that affect the entire Hypervisor as global hyperupcalls. If the super calls are registered as local, we relax the timing requirements and allow the super calls to block and hibernate. Local Hyperupcalls are similar to capture in Guest’s VCPU time, so misbehaving super calls punish themselves.

However, the global super call must complete execution in a timely manner. We made sure that for the Guest operating system, the pages of the global Hyperupcalls request are fixed during the super call and the accessible memory is limited to a configurable 2% of the Guest’s total physical memory. Since local Hyperupcalls can block, the memory they use does not need to be fixed, allowing local Hyperupcalls to resolve all ·Guest memory.

Resolve eBPF limitations. Although eBPF is expressive, the security guarantees of eBPF bytecode mean that it is not Turing-complete and limited, so only a portion of C code can be compiled into eBPF. The main limitations of eBPF are that it does not support loops, ISA contains no atoms, cannot use self-modifying code, function Pointers, static variables, native assembly code, and cannot be too long and complex to verify.

One consequence of these restrictions is that HyperupCall developers must be aware of the code complexity of HyperupCall, because complex code will fail the verifier. While this may seem an unintuitive limitation, other Linux developers who use BPF face the same limitation, and we provide a helper function in the framework to reduce complexity, such as memset and memcpy, as well as functions that perform native atomic operations, such as CMPXCHG. The selection of these helper functions is shown in Table 3. In addition, our framework masks memory access (Chapter 3.4), which greatly reduces the complexity of validation. In practice, as long as we carefully unrolled the loop, we did not encounter validator problems in the use cases we developed (Chapter 4) using the setup of 4096 instructions and a stack depth of 1024.

Hyperupcall interface. When hyperhypervisor calls hyperupCall, it populates a context data structure, as shown in Table 4. Hyperupupall receives an event data structure indicating the reason for calling the callback, as well as a pointer to the Guest virtual machine (in the Hypervisor’s address space, Hyperupcall executing). When hyperupCall is complete, it can return a value that can be used by the Hypervisor.

Write hyperupcall. With our framework, operating system developers write C code that has access to operating system variables and data structures, supplemented by the framework’s accessibility. A typical HyperupCall reads event fields, reads or updates OS data structures, and possibly returns data to the Hypervisor. Because HyperupCall is part of the operating system, developers can reference the same data structures used by the operating system itself – for example, through header files. This greatly increases the maintainability of Hyperupcalls, as data layout changes are synchronized between OS sources and HyperupCall sources.

It is important to note that HyperupCall cannot directly call guest VIRTUAL machine operating system functions because this code is not yet protected by the framework. However, OS functionality can be compiled into Hyperupcalls and integrated into proven code.

3.2 build

Once a Hyperupcall is written, it needs to be compiled into eBPF bytecode before Guest can register it with the Hypervisor. Our framework runs hyperupcall C code through the Clang and eBPF LLUM backends to generate this bytecode as part of the Guest operating system build process, with some modifications to assist in address translation and validation:

Guest Memory access. To access Guest memory, we use eBPF’s direct Packet Access (DPA) feature, which is designed to allow programs to access network packets safely and efficiently without using accessibility. Instead of passing network packets, we treat the Guest side as “packets.” Using DPA in this way requires a bug fix to the eBPF LLUM back end [2] because it was written with the assumption that the packet size was G64KB.

Address translation. Hyperupcalls allow the Hypervisor to seamlessly use Guest virtual addresses (GVA), which makes it appear as if it is running Hyperupcalls in the Guest VIRTUAL machine. However, the code is actually executed by the Hypervisor, which uses the host virtual address (HVAs), making the Guest machine pointer invalid. To allow the use of Guest Pointers transparently in the host context, these Pointers need to be converted from GVA to HVAs. We use a compiler for these translations.

To simplify this transformation, the Hypervisor maps the GVA range continuously into the HVA space, so address translation can be easily done by adjusting the base address. Because guest virtual machine may require hyperupCall to access multiple contiguous GVA ranges – for example, one for guest 1:1 direct mapping and OS text sections [37] – the framework annotates each pointer with its respective “address space” property. We extend the LLUM compiler to use this information to inject eBPF code that converts each pointer from GVA to HVA with a simple subtraction operation. It should be noted that the generated code security is not assumed by the Hypervisor and is verified when registering Hyperupcall.

Binding check. Code that is rejected by the verifier for direct memory access unless it can ensure that memory access is within the “packet” (in our case, Guest memory) boundary. Hyperupcall programmers cannot be expected to perform the required checks because adding them is too burdensome. Therefore, we enhanced the compiler to automatically add code that performs binding checks before each memory access, allowing validation to pass. As we explained in Section 3.4, boundary checking is done using masking rather than branching to simplify validation.

Context cache. Our compiler extension introduces built-in functions to get a pointer to a context or read its data. Context is often needed in callbacks to invoke helper functions and transform GVA. Having context as a function parameter requires intrusive changes and prevents code sharing between the Guest VIRTUAL machine and its Hyperupcall. Instead, we use the compiler to cache the context pointer in one of the registers and retrieve it if needed.

3.3 registered

After compiling hyperupCall into eBPF bytecode, you can register. Guest can register hyperupcalls at any time, but most hyperupcalls are registered while guest is booting. Guest provides the HyperupCall event ID, HyperUpCall bytecode, and virtual memory that will be used by HyperupCall. Each parameter is described as follows:

  • Hyperupcall Event ID. ID of the event to process.
  • Memory registration. Guest Virtual contiguous memory area used by the VM to register hyperupCall. For global Hyperupcalls, this memory is limited to 2% of the total physical memory of the guest VM (configurable and enforced by the Hypervisor).
  • Hyperupcall bytecode. Guest provides a pointer to hyperupCall bytecode and its size.

3.4 validation

The Hypervisor verifies that each HyperupCall is secure when registered. Our validator is based on Linux eBPF validators and checks three properties of HyperupCall: memory access, runtime instruction count, and helper functions used.

Ideally, validation is reasonable, ensuring that only secure code passes validation and that any security program can be fully and successfully validated. Although health is not compromised by the possibility of compromising system security, many validation systems (including eBPF) sacrifice integrity to keep the verifier simple. In practice, verifiers require that the program be written in some way to pass validation [66], and even then the validation may fail due to path explosions. These limitations are inconsistent with our goal of making Hyperupcalls easy to build.

We’ll discuss the properties checked by the validator below and how we can simplify these checks to make validation as simple as possible.

Bounded runtime instruction. For global Hyperupcalls, the eBPF validator ensures that any possible execution of the HyperupCall contains a limited number of instructions set by the Hypervisor (default 4096). This ensures that the Hypervisor can execute the HyperupCall in a timely manner and that no infinite loop can cause the HyperupCall not to exit.

Memory access validation. The validator ensures that memory access occurs only in the area defined by the “grouping,” which in the super call is the virtual memory area provided during registration. As mentioned earlier, we enhanced the compiler to automatically add code to prove that every memory access is a safe validator.

However, adding such code naively leads to frequent validation failures. Current Linux eBPF validators have very limited ability to verify the security of memory access because it requires them to be preceded by compare and branch instructions to prevent binding access. The verifier explores possible execution paths and secures them. Although validators employ various optimizations to pry branches and avoid walking on every possible branch, validation often depletes available resources and fails because we and others have experienced it [65].

Therefore, instead of using Compare and Branch to secure memory access, our enhanced compiler adds code to mask memory access offsets within each scope, preventing out-of-bounds memory access. We enhance the validation program to recognize this mask as secure. After applying this enhancement, all the programs we wrote were validated.

Auxiliary function safety. Hyperupcalls can call helper functions to improve performance and help limit the number of runtime instructions. Helper functions are a standard eBPF feature that validators enforce callable helper functions that may vary from event to event, depending on Hypervisor policy. For example, the Hypervisor might not allow flush_TLB_vCPU during memory reclamation because it might block the system for an extended period of time.

The validator checks to ensure that the input of the helper function is safe and that the helper function accesses only the memory it is allowed to access. While these checks can be done in helper functions, the new eBPF extension allows validators to statically validate helper function inputs. In addition, the Hypervisor can set input policies on a per-event basis (for example, the memory size of a global super call).

The number and complexity of helper functions should also be limited as they become part of the foundation of trusted computing. Therefore, we will cover only simple helper functions that rely on code that can be triggered directly or indirectly by the Guest virtual machine, such as interrupt injection.

EBPF security. Two proof-of-concept vulnerabilities of the recently discovered ghost hardware vulnerability [38,30] target eBPF, which may raise concerns about eBPF and high call security. It is easier to exploit these vulnerabilities if an attacker can run unprivileged code in a privileged context, as with Hyperupcalls, to prevent discovered attacks [63]. In fact, these security vulnerabilities may make HyperupCall even more attractive because their mitigation techniques (e.g., return stack buffer fill [33]) incur additional overhead when context switches are performed using traditional semi-virtual mechanisms such as upcalls and Hypercalls.

3.5 perform

Authenticated HyperupCalls are installed in the HyperupCall table of each guest VM. Once the hypercall is registered and verified, the Hypervisor executes the HyperupCall in response to the event.

Hyperupcall patches. To avoid the overhead of testing whether hyperupcall is registered, the Hypervisor uses a code patch technique known in Linux as “static keys” [12] : A no action instruction is set on each Hyperupcall call code on the Hypervisor only if the Hyperupcalls are registered.

Access remote VCPU status. Some Hyperupcalls read or modify the state of the remote VCPU. These Vcpus may not be running, or their state may be accessed by different threads of the Hypervisor. Even if the remote VCPU is preempted, the Hypervisor may have read some registers and not expect them to change until the VCPU resumes execution. If HyperupCall writes to remote VCPU registers, it may break Hypervisor constants and even introduce security issues.

VCPU stands for virtual central processing unit. One or more Vcpus are assigned to each virtual machine (VM) in the cloud. The VM’s operating system treats each vCPU as a single physical CPU core. If the host has multiple CPU cores, the vCPU actually consists of multiple time slots on all available cores, allowing multiple VMS to be hosted on a smaller number of physical cores.

In addition, reading remote VCPU registers causes high overhead because a portion of the VCPU state may be cached in another CPU, and if the VCPU state is to be read, it must first be written back to storage. More importantly, in Intel cpus, the VCPU state cannot be accessed by common instructions, and must first be “loaded” before its state can be accessed using special instructions (VMREAD and VMWRITE). Switch-loaded Vcpus can be very expensive, taking about 1800 cycles on our system.

To improve performance, we define synchronization points that typically preempt the Hypervisor and know that access to the VCPU state is safe. At these points, we “decrypt” the VCPU registers from the VMCS and write them to memory so that hyperupCall can read them. The super call writes to the remote VCPU register and updates the shred value to mark the Hypervisor to reload the register value into the VMCS before restoring the VCPU. Hyperupcalls to remote Vcpus are executed as best they can and run only when the VCPU is at the sync point. Prevents remote VCPU recovery execution while HyperupCall is running.

Use Guest operating system locks. Some OS data structures are protected by locks. Hyperupcalls that require a consistent view of the Guest operating system’s data structure should follow the synchronization scheme specified by the Guest operating system. However, Hyperupcall can only acquire the lock opportunistically because the VCPU may be preempted while holding the lock. It may be necessary to adjust the locking implementation to support locking by external entities rather than any VCPU. Releasing locks may require relatively large code to handle slow paths, which may prevent timely validation of super calls.

While various interim solutions may be proposed, it seems that the complete solution requires Guest operating system locking to be highly reported. It also needs to support calling eBPF functions from eBPF code to avoid code size bloat that can cause validation failures. Due to the recent addition of this support, our implementation does not include lock support.

Use cases and evaluation

Our assessment was guided by the following questions:

  • What is the overhead of using verified code (eBPF) versus native code? (Section 4.1)
  • How do hyperupcalls compare to other paravirtualization mechanisms (Sections 4.3, 4.2, 4.5)?
  • How can Hyperupcalls not only improve performance (Chapter 4.3, 4.2), but also security (Chapter 4.5) and debugging (Chapter 4.4 virtualization)?

Test platform. Our test platform included a 48-core dual-slot Dell PowerEdge 8630 server with an Intel ES-2670 CPU and a Seagate ST1200 disk running Ubuntu 17.04 with Linux kernel V4.8. The benchmark is for guest virtual machines with 16 Vcpus and 8GB OF RAM. Each measurement was performed 5 times and average results were reported.

Hyperupcall prototype. We implemented a prototype for HyperupCall support on Linux V4.8 and KVM, which is a Hypervisor integrated with Linux. Hyperupcalls are compiled with patched LLVM 4 and validated with the Linux kernel eBPF validator using the patch we described in Chapter 3. We enable the Linux eBPF “JIT” engine, which compiles eBPF code to native machine code after validation. The correctness of the BPF JIT engine has been investigated and can be verified [74].

Use cases. We evaluated the four HyperupCall use cases listed in Table 5. Each use case demonstrates the use of Hyperupcalls in a different hypervisor event and uses hyperupcalls of different complexity.

4.1 Hyperupcall overhead

We evaluated the overhead of using proven code to manage Hypervisor requests by comparing the running time of HyperupCall with native code to the same function (Table 5). Overall, we found that the absolute overhead of validating code relative to native is small (<250 loops). For TLB use cases that handle TLB shootdown to an inactive core, our HyperupCall runs faster than native code because TLB refresh is delayed. The overhead of verifying hyperupCall is small. For the longest hyperupcall, validation took 67ms.

4.2 TLB Shootdown

While interrupt delivery to the VCPU is usually done efficiently, there is a significant loss if the target VCPU is not running. This happens if the CPU is overloaded and the scheduling target VCPU needs to preempt another VCPU. For synchronous interprocessor interrupts (IPI), the sender only resumes execution after the receiver indicates that the IPI has been delivered and connected, resulting in excessive overhead.

The overhead of IPI passing is most significant in the case of shoot-down of the translation backup buffer (TLB), a software protocol that the OS uses to maintain a TLB cache associated with virtual to physical address mapping. Because common CPU architectures (for example, x86) do not make TLBS consistent across hardware, OS threads that modify mappings send IPI to other cpus that might cache mappings, which then flush their TLBS. When a processor changes the virtual-to-physical mapping of an address, it needs to tell other processors to invalidate the mapping in its cache. This process is called “TLBshootdown”.

We use HyperupCalls to handle this situation by registering a HyperupCall that handles TLB shootdown when the interrupt is passed to the VCPU. The Hypervisor provides super calls using interrupt vectors and target Vcpus after ensuring that it is at rest. Our HyperupCall checks whether this vector is a “remote function call” vector and whether the function pointer is equal to the OS TLB refresh function. If so, it runs this function with very few modifications :(1) instead of flushing TLB with native instructions, it uses helper functions to perform TLB flushing, deferring it to the next VCPU reentrant; (2) TLB flush is performed even when VCPU interrupts are disabled because it improves performance experimentally.

Admittedly, there is another solution: introduce a super call that delegates TLB flushing to the Hypervisor [52]. While this solution prevents TLB refreshes, it requires a different code path, which can introduce hidden errors [43], complicate integration with OS code or introduce additional overhead [44]. This solution is also limited to TLB flushes and cannot handle other interrupts, such as rescheduling IPI.

Evaluation. We run the Apache Web server in the guest VIRTUAL machine using the default MPM_event module [23], which runs a multi-threaded worker to handle incoming requests. To measure performance, we used ApacheBench, the Apache HTTP server benchmark tool, to generate 10K requests with 16 connections and measure request latency. The results are shown in Figure 2, showing that Hyperupcalls reduce latency by 1.3x. It may seem surprising that performance would improve even if physical cpus were not oversubscribed. However, because vcpus are typically temporarily idle in this benchmark, they can also trigger an exit from the Hypervisor.

4.3 Discarding The Available Memory

By definition, the available memory does not contain any required data and can be discarded. If the Hypervisor knows what memory is free in the Guest virtual machine, it can discard it during memory reclamation, snapshot, live migration, or lock steps [20] and avoid I/O operations to save and restore its contents. However, information about which memory pages are free is held by Guest and is not available to the Hypervisor due to semantic differences.

Over the years, several mechanisms have been proposed to inform the Hypervisor which storage pages are free using paraviralization. However, these solutions either couple the Guest side to the Hypervisor [60]; Overhead due to frequent super calls [41] may be limited to live migration [73]. All of these mechanisms are inherently limited: there is no coupling between the Guest machine and the Hypervisor, and the Guest machine needs to communicate with the Hypervisor which pages are free.

In contrast, hypervisors that support Hyperupcalls do not need to notify about idle pages. Instead, the guest VIRTUAL machine sets up a HyperupCall that describes whether a page can be discarded based on page metadata (Linux’s structural pages) and is based on the is_free_Buddy_page function in Linux. When the Hypervisor performs an operation that could benefit from discarding a free Guest memory page, such as a reclaim page, the Hypervisor calls this super call to check that the page is discardable. Hyperupcall is also called when a page has been unmapped to prevent its race from being discarded when it is no longer idle.

Checking whether a page can be discarded must be done through a global super call because the answer must be provided within a limited and short time. As a result, the Guest virtual machine can register only part of its memory for hyperupCall use, because this memory is never paged to ensure timely HyperupCall execution. Our Linux Guest virtual machine registered page metadata in approximately 1.6% of guest physical memory. Evaluation. To evaluate the performance of a “memory drop” Hyperupcall, we measured its impact on guest virtual machines that were reclaiming memory due to memory stress. When memory is low, the Hypervisor can perform an “uncooperative swap” – reclaim Guest machine memory and swap it to disk. However, this approach often leads to suboptimal collection decisions. Alternatively, the Hypervisor can use memory ballooning, which is a semi-virtual mechanism in which the Guest module is informed of host memory pressure and causes the Guest to reclaim memory directly [71]. Guest can then make a knowledgeable decision to recycle and discard free pages. Although memory bloat generally performs well, performance suffers when memory needs to be suddenly recollected [4,6] or when Guest disks are set up on network attached storage [68] and therefore are not used under high memory pressures [21].

To evaluate memory bloat, uncooperative swapping, and swapping using Hyperupcalls, we ran a scenario where memory and physical CPUS needed to be suddenly repurged to accommodate a new guest virtual machine. In Guest, we start and exit “memhog” so that 4GB is available for recycling in Guest. Next, we kept Guest busy running the SYSbench CPU performance test with the low-memory CPU-intensive task one, which counted all the virtual processors used [39] as prime numbers.

Now, when the system is busy, we simulate the need to reclaim resources to start a new Guest virtual machine by increasing memory and CPU overuse. We reduced the number of physical cpus available to the Guest virtual machine and limited it to just 1GB of memory. We measured the time required to reclaim memory based on the number of physical cpus allocated to the Guest virtual machine (Figure 3a). This simulates the start of a new Guest. We then stopped increasing the memory pressure and measured the time to run the Guest application with a large memory footprint using the 4GB SysBench file read benchmark (Figure 3B). This simulates the guest virtual machine reusing pages reclaimed by the Hypervisor.

When the physical CPU is overused, Ballooning will slowly reclaim memory (up to 110 seconds) because the memory reclaim operation will compete with cpu-intensive tasks at CPU time. An uncooperative swap (swap base) can recycle faster (32 seconds), but because it does not know if the memory page is free, it causes a higher overhead of Guest free pages. In contrast, when hyperupcalls are used, the Hypervisor can facilitate the recycling of free pages and discard them, thereby reclaiming memory eight times faster than Ballooning, while memory slows down by only 10%.

Of course, CPU overuse is not the only case where Ballooning is unresponsive or unusable. When memory pressure is very high, the Hypervisor avoids bloating and uses host-level swapping instead [67]. Hyperupcalls can run in conjunction with Ballooning: The Hypervisor can use Ballooning normally and hyperupcalls when the resource pressure is high or when Ballooning is not responding.

4.4 Tracing

Event tracing is an important tool for debugging correctness and performance issues. However, there are limitations to collecting traces for virtualized workloads. Traces collected within a Guest virtual machine do not display Hypervisor events, such as forcing a VM to exit, which can have a significant impact on its performance. For trace information collected in the Hypervisor, they need knowledge of the Guest operating system notation [15]. Such traces cannot be collected in a cloud environment. In addition, each trace only collects partial events and does not show how guest virtual machine and hypervisor events are interleaved.

To solve this problem, we run the Linux kernel trace tool ftrace [57] in HyperupCall. Ftrace is perfect for running in HyperupCall. It is simple, lock-free, and concurrent tracing can be enabled in multiple contexts: Non-maskable interrupts (NMI), hardware and software interrupt handlers, and user processes. Therefore, it is easy to accommodate tracking Hypervisor events along with Guest events. With fTrace HyperupCall, the Guest virtual machine can trace Hypervisor and Guest events in a unified log, simplifying debugging. Because all events are tracked using only Guest logic, new OS versions can change the trace logic without changing the Hypervisor. Evaluation. Tracing is efficient despite the super call complexity (3308 eBPF instructions), because most of the code handles uncommon events that handle the case of the trace page filling up. Using Hyperupcalls tracing is slower than using native code for 232 cycles, which is still much shorter than the context switch time between the Hypervisor and the Guest side. .

Tracing is a useful tool for performance debugging, exposing a variety of overhead [79]. For example, by registering ftrace on a VM-exit event, we see that many processes (including short-run processes) trigger multiple VM exits due to the execution of CPUID instructions that enumerate CPU functionality and must be performed by the Hypervisor. We found that the GNU C library used by most Linux applications uses the CPUID to determine supported CPU functions. This overhead can be prevented by extending Linux virtual Dynamic Shared Objects (vDSO) so that applications can query supported CPU capabilities without triggering exit.

4.5 Kernel Self-Protection

A common security hardening mechanism used by operating systems is self-protection: write protection of operating system code and immutable data. However, this protection is done using the page table, allowing malware to circumvent it by modifying the page table entry. To protect against such attacks, nested page tables are recommended, as these tables cannot be accessed from the Guest virtual machine [50].

However, nesting can only provide a limited number of policies, such as whitelisting Guest code that allows access to protected memory. Hyperupcalls are more expressive and allow Guest to specify memory protection in a flexible way.

We use Hyperupcalls to provide hypervisor-level Guest kernel self-protection, which can be easily modified to accommodate complex policies. In our implementation, guest sets up a bitmap that marks the protected page and registers a HyperupCall on the exit event, which checks for the exit reason, whether memory access has occurred, and whether the Guest VIRTUAL machine has attempted to write to protected memory based on the bitmap. If you try to access protected memory, a VM shutdown is triggered. The Guest VIRTUAL machine sets up another super call on the “Page Map” event, which queries for the protection required for the Guest page frame. This super call prevents the hypervisor from proactively pre-empting Guest machine memory.

Evaluation. The super-calling code is simple, but each exit incurs 43 cycles of overhead. Arguably, only workloads that have already experienced a lot of context switching are affected by the extra overhead. Modern cpus prevent this kind of frequent switching.

5 conclusion

Bridging the semantic gap is key performance, and the Hypervisor can provide advanced services to Guest. Hypercalls and upcalls are now used to bridge the gap, but they have several disadvantages: Hypercalls cannot be started by the Hypervisor, upcalls do not have a finite runtime, and both result in context-switching penalties. The introspective, alternative to avoiding context switching may be unreliable because it relies on observation rather than explicit interfaces. Hyperupcalls overcome these limitations by allowing guest VMS to expose their logic to the Hypervisor, avoiding context switches by enabling Hyperupcalls to execute guest VM logic directly and securely.

After the body

When the teacher commented on the class, she agreed with my understanding of the idea of third-party notarization monitor, which is amazing!! Semantic gap problem is a classic problem of virtual machine, it must exist but has a great impact on performance. So why not put a monitor inside the virtual machine like introspection? This will break the encapsulation of the VIRTUAL machine. The virtual machine has to fool itself into thinking that I’m a complete machine, not a timeslot of some other physical machine, so how would you feel if you found a monitor inside you? ! Third party, virtual machine can be considered as external communication.