The author
Jiang Biao, senior engineer of Tencent Cloud, has been focusing on operating system related technology for more than 10 years, and is a senior Linux kernel enthusiast. Currently, I am responsible for the development of Tencent cloud native OS and the performance optimization of OS/ virtualization.
Introduction:
A mixed service is usually deployed in an offline service (or off-online service). Online services (usually delay-sensitive tasks with high priority) and offline tasks (usually CPU-consuming tasks with low priority) are deployed on the same node to improve node resource utilization. The key difficulty lies in the underlying resource isolation technology, which relies heavily on the OS kernel, while the resource isolation capability provided by the existing native Linux kernel is once again inadequate (or at least not perfect) in the face of mixed requirements, and still needs to be deeply hacks to meet the production-level requirements.
(Cloud native) resource isolation technology mainly includes CPU, memory, IO and network, four aspects. This article focuses on CPU isolation technology and related background, and will continue in a step-by-step manner to expand into other areas.
background
In both IDC and cloud scenarios, low resource utilization is a common problem faced by most users/vendors. On the one hand, the cost of hardware is high (everyone buys it, and most of the hardware (core technology) is in the hands of others, with no pricing power and usually weak bargaining power), and the life cycle is short (it has to be replaced in a few years). Extremely awkward, on the other hand, be unable to make full use of so expensive things, for the most part for CPU usage, the scene of the average occupancy rate is very low (if I take no more than 20% of the mean (average up here, or weeks), believe that most students don’t have the opinion, this means that the thief expensive things actually it only took less than one 5, If you still want to live a serious home, must feel distressed.
Therefore, improving the resource utilization of hosts (nodes) is a task worth exploring, and the benefits are very obvious. And the solution is very straightforward,
Conventional mindset: Do more business. That’s easy to say. Who hasn’t tried. The core difficulty is that normal businesses have distinct peaks and valleys
You might want to look something like this:
But the reality is more likely to look like this:
Capacity planning for businesses needs to be carried out according to Worst Case (assuming that all businesses have the same priority). Specifically, at the CPU level, capacity planning needs to be carried out according to CPU peak value (may be weekly peak value, or even monthly/annual peak value) (usually with a certain amount of spare capacity in Case of emergencies).
In reality, for the most part, the peak is high, but the actual mean is low. As a result, the average CPU in most scenarios is low and the actual CPU utilization is low.
An assumption is made that “all services have the same priority”. The Worst Case of a service determines the overall performance (low resource utilization). If change the idea, but the business to distinguish the priority, have more space to play, can through the sacrifice of low priority business service quality (can usually tolerate) to ensure that the high priority service quality of the business, so can be deployed in a moderate amount of high-priority business at the same time, the deployment of more business (low priority), thus to improve resource utilization as a whole.
Mixed deployment (mixed deployment) is thus born. Here “mix”, in essence, is “distinguish priorities”. In the narrow sense, it can be simply understood as “online + offline” (offline) mixing. In the broad sense, it can be extended to a wider range of applications: mixed deployment of multi-priority services.
The core technologies involved include two aspects:
- Underlying resource isolation technology. Provided (usually) by the operating system (kernel), this is the core focus of this article.
- Upper-layer resource scheduling technology. It is (usually) provided by an upper-level resource orchestration/scheduling framework (typically K8s), and is intended to be developed in a separate series of articles.
Mixing is also a very hot topic and technology direction in the industry. At present, the mainstream leading manufacturers are continuously investing, with obvious value and high technical threshold (barrier). The famous K8s(formerly Borg) actually originated from Google’s mixing scene. From the history and effect of mixing, Google is regarded as the benchmark in the industry, claiming that the CPU usage (average) can reach 60%. For details, please refer to its classic paper:
Dl.acm.org/doi/pdf/10….
Storage.googleapis.com/pub-tools-p…
Of course, Tencent (cloud) in the direction of mixed ministry exploration is also very early, has experienced several large technology/program iteration, so far has a good implementation scale and results, details need to be discussed in another topic, not in this paper.
Technical challenges
As mentioned above, the underlying resource isolation technology is very important in the mixed scenario. The “resources” can be divided into four categories as a whole:
- CPU
- Memory
- IO
- network
This paper focuses on CPU isolation technology and mainly analyzes the technical difficulties, current situation and solutions in CPU isolation.
CPU isolation
Of the four categories mentioned above, CPU resource isolation is arguably the most basic isolation technique. On the one hand, CPU is a compressible (reusable) resource that is relatively easy to reuse and Upstream’s solution is relatively available. On the other hand, CPU resources are closely related to other resources, and the use (application/release) of other resources often depends on the process context and indirectly on CPU resources. For example, when the CPU is isolated (suppressed), other requests such as IO and network may (mostly) be suppressed because the CPU is suppressed (not scheduled).
Therefore, THE effect of CPU isolation will indirectly affect the effect of other resources isolation, CPU isolation is the most core isolation technology.
Kernel scheduler
Specifically, in the OS, CPU isolation is essentially implemented entirely by the kernel scheduler, which is the basic functional unit of the kernel for load allocation of CPU resources. Specifically, it corresponds to the default scheduler of the Linux kernel that we are most exposed to: CFS scheduler (essentially a scheduling class, a set of scheduling policies).
The kernel scheduler determines when and what tasks (processes) are selected to execute on the CPU, thus determining the CPU running time of online and offline tasks in mixed scenarios, thus determining the CPU isolation effect.
Upstream kernel isolation effect
The Linux kernel scheduler provides five scheduling classes by default, and there are basically only two that can be used by actual services:
- CFS
- Real-time scheduler (RT/Deadline)
The nature of CPU isolation in mixed-part scenarios requires:
- Try to suppress offline tasks when online tasks need to run
- When the online task is not running, the offline task runs on the idle CPU
There are several approaches to “suppression” based on the Upstream kernel:
priority
You can lower the priority of offline tasks or improve the priority implementation of online tasks. Without modifying the scheduling class (based on the default CFS), the priority ranges that can be dynamically adjusted are: [-20, 20]
The specific performance of time slice is the allotted time slice within a single scheduling cycle, specifically:
- The weight ratio of time slices allocated between common priority 0 and lowest priority 19 is 1024/15, which is about: 68:1
- The weight ratio of time slice allocation between the highest priority -20 and the common priority 0 is 88761/1024, which is about: 87:1
- The weight ratio of time slice allocation between the highest priority -20 and the lowest priority 19 is 88761/15, which is about 5917:1
It seems that the suppression ratio is still relatively high, and the priority of offline tasks is set to 20, and the default value of online tasks is 0(common practice). At this time, the weight of online and offline time slices is 68:1.
Assuming a single scheduling cycle length of 24ms(the default configuration for most systems), it appears (as a rough estimate) that the amount of time that can be allocated offline in a single scheduling cycle is about 24ms/69=348us, occupying about 1/69=1.4% of the CPU.
The actual run logic is slightly different: CFS takes into account throughput and sets the minimum time granularity protection for a single run (the minimum time for a process to run in a single run) : Sched_min_granularity_ns, set to 10ms in most cases, means that once preemption occurs offline, it can run for 10ms, meaning that scheduling delays (RR switch delays) for online tasks can be up to 10ms.
Wakeup also has a minimum time-granularity protection (Wakeup guarantees the minimum running time of a preempted task) : sched_wakeup_granularity_ns, set to 4ms in most cases. This means that once offline tasks run, wakeup latency(another typical scheduling delay) for online tasks can also be up to 4ms.
In addition, adjust the priority and cannot take logic optimization, in particular, in the implementation of preemption (wakeup and periodic), does not refer to the priority, will not be because of different priority, and real-time different preemption strategy (not for offline task priority is low, and suppress its preemption, reduce the preemption time), so could lead to offline from unnecessary preemption, This leads to interference.
Cgroup(CPU share)
The Linux kernel provides CPU Cgroups (corresponding to container pods), which can be prioritized by setting the share value of the Cgroup, which can be suppressed by lowering the share value of the offline Cgroup. For Cgroup v1, the default share value for Cgroup v1 is 1024, and the default share value for Cgruop v2 is 100. Therefore, in CFS, the corresponding weight ratio of time slice allocation is 1024:1 and 100:1 respectively, and the corresponding CPU usage is about 0.1% and 1% respectively.
The actual running logic is still limited by sched_min_granularity_ns and sched_wakeup_granularity_ns. The logic is similar to the priority scenario.
Similar to the priority scheme, preemption logic is not optimized for share values and there may be additional interference.
Special policy
The CFS also provides a special scheduling policy, SCHED_IDLE, for running very low-priority tasks that seem to be designed for “offline tasks.” A SCHED_IDLE task essentially has a CFS task with a weight of 3, with a time slice weight ratio of 1024:3, or 334:1, to a normal task, with a CPU usage of about 0.3% for offline tasks. Time slice allocation is as follows:
The actual running logic is still limited by sched_min_granularity_ns and sched_wakeup_granularity_ns. The logic is similar to the priority scenario.
CFS specifically optimizes preemption logic for SCHED_IDLE tasks to suppress preemption time for SCHED_IDLE tasks, so from that point of view, SCHED_IDLE takes a small step forward with “fit” (although Upstream does not mean it) for mixed scenes.
In addition, because SCHED_IDLE is a per-task marker, there is no SCHED_IDLE capability at the Cgroup level, and CFS scheduling requires selecting a (Task)group and then selecting a task from the group. So using SCHED_IDLE alone is not practical for cloud-native scenarios (containers) mix.
Overall, while CFS provides priority (share/SCHED_IDLE are similar in principle and are essentially priorities) and allows some suppression of low-priority tasks based on priority, the core design of CFS is “fairness” and does not provide absolute suppression of offline tasks. Even if the “priority” is set to the lowest, offline tasks still get a fixed slice of time that is snatched from online tasks rather than idle CPU slices. In other words, the “fair design” of CFS makes it impossible to completely avoid the interference of offline tasks from online and achieve perfect isolation effect.
In addition, by offline tasks (limit) lower priority (the above several programs, both in the nature, in essence, also compress the offline task priority space, in other words, if you still want to further distinguish between off-line task priority (offline tasks may also have QoS to distinguish between actual may have such demand), There’s nothing I can do about it.
In addition, from the perspective of the underlying implementation, due to the use of CFS scheduling class both online and offline, in the actual runtime, online and offline share the runqueue (RQ), superimpose the load, and share the Load balance mechanism. On the one hand, offline operations on shared resources (such as runqueue) need to be synchronized (lock). Lock primitives themselves are not prioritized and can not exclude off-line interference. On the other hand, the load Balance cannot distinguish between offline tasks, and special processing (such as aggressive balance to prevent hunger and improve CPU utilization) is performed on them. Therefore, the balance effect of offline tasks cannot be controlled.
Real-time priority
At this point, you might be thinking, if you need absolute preemption (suppressing offline), why not use the real-time scheduling class (RT/ Deadline)? The real-time scheduling class achieves exactly the “absolute suppression” effect compared to CFS.
That’s true. However, in this way, online services can only be set to real-time and offline tasks can be kept to CFS. In this way, online can absolutely preempt offline, and rT_throttle can ensure offline starvation if you are worried about offline starvation.
It looks “perfect,” but it’s not. The nature of this practice, will compress online task priority space and living space (in contrast to the lower offline task priority results before), the result is online business can only be used for real-time scheduling classes (although most of the online business does not satisfy the real-time characteristics of type), no longer use of CFS native ability (such as the fair scheduling, Cgroup, etc., Which is exactly what online tasks need).
Simple terms, the question is: real-time type cannot meet the needs of the running of their online task, essentially online business itself is not real-time tasks, so strong torsion for real-time, can have serious side effects, such as system task (OS own tasks, such as various kernel threads and system service) will be hungry, etc.
To summarize, for real-time priority scenarios:
- Recognize the “absolute suppression” ability of real-time types over CFS types (which is exactly what we want)
- However, in the current implementation of Upstream Kernel, online tasks can only be set to the real-time type with higher priority than CFS, which is unacceptable in practical application scenarios.
Priority inversion
At this point, there’s probably a big question mark in your mind: will there be a priority reversal after “absolute suppression”? How to do?
The answer is: there is a priority inversion problem
Explain the logic of priority inversion in this scenario: If there is a Shared resource between tasks online and offline tasks (such as the kernel of some public figures, such as the/proc file system), when offline tasks get a lock for access to a Shared resource (abstract, is not necessarily the lock), if it is “absolutely suppress”, has been unable to run, when online tasks also need access to a Shared resource, and waiting for the corresponding lock, Priority inversion occurs, resulting in deadlocks (long blocks are also possible). Priority inversion is a classical problem to be considered in scheduling model.
Roughly summarize the conditions for priority inversion:
- A shared resource exists offline.
- Concurrent access to a shared resource exists and is protected by a sleep lock.
- After getting the lock offline, it is completely and absolutely suppressed and has no chance to run. All cpus are 100% occupied by online tasks, leaving no chance to run offline. (In theory, as long as there is a free CPU, it is possible to use the load Balance mechanism for offline tasks.)
In the cloud native hybrid scenario, the treatment method (idea) of priority inversion problem depends on the perspective of the problem. We can view it from the following different perspectives:
- How likely is priority inversion? This depends on the actual application scenario. In theory, if online and offline services do not share resources, priority inversion will not occur. In the cloud native scenario, there are roughly two cases:
(1) Secure container scenario. In this scenario, services actually run in virtual machines (abstract understanding), and virtual machines themselves ensure the isolation of most resources. In this scenario, priority inversion can be avoided (if it does exist, it can be handled separately on a case-by-case basis).
(2) Common container scenario. In this scenario, services run in containers with shared resources, such as common kernel resources, shared file systems, and so on. As the previous analysis, on the premise of existence to a Shared resource, the priority inversion of still more stringent conditions, one of the most critical condition is: all are online tasks 100% CPU usage, this kind of situation in the real scenario, it is very rare, very extreme scenario, the reality can be dealt with individually such extreme scenario “”
Therefore, in (most) real cloud native scenarios, we can assume that, provided the scheduler optimization /hack is good enough, it can be avoided.
- How is priority inversion handled? Although priority inversion only occurs in extreme scenarios, how should you handle it if you must (Upstream must consider)?
(1) Upstream’s idea. The CFS implementation of the native Linux Kernel also reserves weight for the lowest priority (think SCHED_IDLE), which means that the lowest priority tasks also get time slices, thus avoiding priority inversion. This has been the attitude of the community: universal, even in extreme situations, requires perfect cover. Such a design is precisely why absolute suppression cannot be achieved. This is fine from a design point of view, but it’s not perfect for a cloud-native hybrid scenario: it doesn’t sense the offline hunger level, that is, online preemption may occur even when offline hunger is not present, leading to unnecessary interference.
(2) Another idea. Optimized design for cloud native scenarios: Perceiving offline hunger and the possibility of priority reversal, but preempting only when offline hunger could lead to priority reversal (i.e., as a last resort). This avoids different preemption (interference) on the one hand, and avoids priority inversion problems on the other. To achieve (relatively) perfect results. Of course, I have to admit that such a design is not so Generic or Graceful as to be acceptable to Upstream.
Hyperthreading interference
At this point, another key issue is missing: hyper-threading interference. This is also the mixed scene of the disease, the industry has not targeted solutions.
The problem is that hyperthreads on the same physical CPU share core hardware resources, such as caches and cells. When online tasks and offline tasks run on a pair of hyperthreads at the same time, they will interfere with each other because of competing for hardware resources. The CFS was not designed with this in mind
As a result, the performance of online services deteriorates in a mixed service scenario. The actual test, using CPU-intensive Benchmark, had 40%+ performance interference due to hyper-threading.
Note: Intel official data: physical core performance is only about 1.2 times of single-core performance.
Hyperthreading interference is a key problem in hybrid scenarios, and CFS was (almost) completely absent from its original design. It’s not that CFS was designed for hybrid scenarios, but for more general, macro scenarios.
Core scheduling
If you are a Core scheduler, you may wonder: Haven’t you heard of Core scheduling and its ability to solve hyper-threading problems?
Core Scheduling is a new feature submitted in 2019 by the Core scheduler module Maintainer Perter (based on the coscheduling concept proposed earlier in the community). The main target is to address the L1TF vulnerability (which causes data leakage due to cache sharing between hyperthreads). It applies to the following scenarios: in the cloud host scenario, data leakage caused by different VM processes running in the same hyperthread pair is avoided.
The core idea is to avoid processes with different tags running on the same pair of hyperthreads.
The status quo is: The Core Scheduling Patchset finally, just recently (2021.4.22), after v10-long iterations, nearly 2 years of discussion and improvement /rework, – Perter has issued a version that looks like it might be able to enter the master (when is unclear) :
Lkml.org/lkml/2021/4…
This topic deserves a separate in-depth share that is not covered here. Please also look forward to…
Here is a direct (personal) opinion (pat) :
- Core Scheduling is a real solution to the hyper-threading problem.
- Core Scheduling is designed to address security vulnerabilities (L1TF), not mixed-part hyper-threading interference. Due to the need to ensure security, absolute isolation, complex (expensive) synchronization primitives (such as core level RQ lock), heavyweight feature implementation, such as core range pick task, excessive force Idle. In addition, there is concurrent isolation of the accompanying interrupt context.
- The design and implementation of Core Scheduling is too heavy and too expensive. After enabling it, the performance regression is serious and it cannot distinguish online from offline. Not a good fit for mixed scenarios.
Core Scheduling is also not designed for a cloud native hybrid scenario.
conclusion
Based on the previous analysis, the advantages and problems of various existing schemes can be abstractly summarized.
Based on priority (similar to SHARE /SCHED_IDLE) in CFS, advantages:
- Gm. Strong capability, can hold most of the application scenarios
- Can (mostly) avoid priority inversion problems
Question:
- Isolation is not perfect (no absolute suppression)
- Various other minor faults (imperfections)
The scheme based on real-time task type has the following advantages:
- Absolute suppression, perfect isolation
- Mechanisms to avoid rT_throttle
Question:
- Do not apply. Online tasks cannot (for the most part) use the real-time task type.
- There is a mechanism (RT_throttle) to avoid priority inversion, but when it is enabled, isolation is imperfect.
Ultra threading interference isolation based on Core Scheduling, advantages:
- Perfect hyperthreading interference isolation effect
Question:
- Too much design, too much overhead
conclusion
Upstream Linux Kernel is designed to be versatile and elegant, which is difficult to meet the extreme needs of specific scenarios (cloud native hybrid). To pursue excellence and perfection, you still need to Hack deeply. TencentOS Server is always on the way. (Sound familiar? I did say that before!
About the specific implementation and code analysis of the Linux Kernel scheduler (based on 5.4 kernel (Tkernel4)), we will launch a series of corresponding analysis articles in the future. While discussing the pain points of cloud native scenarios, we will combine the corresponding code analysis to reduce the mystery of the Linux kernel. Explore the wider Hack space. Stay tuned.
thinking
-
What is the ideal approach if you want your online business to use CFS (taking advantage of the power of CFS) while still having the ability to “absolutely suppress”? I feel like the answer is coming!
-
What if you don’t need perfect isolation (absolute suppression), you still need to handle priority inversion, you still need “near-perfect” isolation, and you still want to make the most of your existing mechanism (not too big a scheduler Hack, less risky)? (Take a look at the previous analysis of various existing solutions, and you feel like you’re getting close to the answer.)