This article originated from personal public account: TechFlow, original is not easy, for attention


Today, the 12th article in our distributed series, we continue our look at clustered resource management systems.

In the last article, we took a brief look at what distributed cluster resource management is, the background of its birth and the problems it solves, as well as its advantages and disadvantages. The content of the previous chapter is superficial, without too much in-depth principle, this article we take a look at the principle part of cluster management system.

For those of you who have forgotten an article, or who have recently followed it, you can click on the portal below to review the last one.

The principle of distributed cluster resource management system

The principle of local priority

In the case of big data applications, there is a basic design principle: we usually allocate the computation to the node where the data is stored, rather than take the data from the node and perform the calculation. The reason behind this is easy to see, as it minimizes network communication between nodes and reduces data transfer. It should be noted that the scale of data in big data scenarios is very large, ranging from TB and PB to hundreds and dozens of GB at a minimum. Once network data transmission is required, the overhead is very considerable.

To summarize this principle, we can call it the “local first” principle, which means that the nodes performing tasks are as local as possible, which is perfect in a physical machine because it avoids all data transfer.

We can simply measure the capability of a clustered scheduling system based on this principle, and we can simply list three levels based on locality. The first is node locality, where all computing takes place on one node, avoiding all data transfers, which is also the best case scenario. The second is less powerful, called rack locality, which means that all computations are not guaranteed to be performed on one node, but at least the machines performing them are guaranteed to be on the same rack. A machine in a rack can transmit data internally, rather than using an external network, which is also much faster. The worst case scenario is when nodes are scattered across different racks and there is no way to do any acceleration, which is called global locality and incurs a lot of overhead.

We all know that a good cluster scheduling system “squeezes” the performance of all the machines in the cluster to the extreme, but in fact the fixed computing consumption of computing resources is basically stable. In addition to machine utilization, another key point is this. And sometimes the overhead of network IO can be worse than low machine usage.

Resource Allocation details

About the allocation of resources can we intuitively understand is very simple, when the machine idle task we assigned to it, and then performed we withdraw resources, but in the actual use of there are many details we need to carefully design and thinking, once a little didn’t think may cause serious consequences.

Let’s look at two questions. The first question is what do we do when we have a new task and we don’t have enough resources?

It’s easy to think of two strategies. The first strategy is to do nothing, etc. Wait until the task is finished and the resources are released before we execute the current task. But what if our current task is a very urgent one? It is possible that the tasks that are being performed now are not high priority, but they take up a lot of resources. Do we have to wait forever?

So we think of a second strategy, which is robbery. Since our current resources are a high priority, our ongoing tasks are a low priority. We can start by taking some resources away from low-priority tasks and focusing on the most important tasks of the moment. Wait for high-priority tasks to be executed before performing low-priority tasks. But unfortunately, there are problems with that. Let’s put aside technical issues for a moment, but do you think anyone would want to lower the priority of the task? Make sure everyone’s tasks are the highest priority. Later, you will find that the so-called priority setting has become a decoration, running the same priority, the highest priority.

There’s a follow-up to that question, but let’s take a look at another question, which is a derivative of that question. Different tasks require different resources and require them in different ways. For example, some tasks can run with any number of resources, such as Spark or MapReduce. With more resources, more machines are used and fewer resources are used. However, tasks can run regardless of the number of resources, but the duration of execution is different. However, some tasks, such as machine learning tasks, may require a large amount of resources at one time, and so much that they cannot be run without. So the question is, when we have a new task and the current resources are not enough to meet its allocation needs, should we keep it and not allocate it until the resources are enough for one-time allocation, or should we allocate some resources first and then continue to allocate some resources later?

You see, there are a lot of tricky details in what looks like an unremarkable allocation strategy. Indeed, this is why I said before that the current cluster resource management system is far from mature and just started, because there is no particularly good solution to the above problems.

Starvation and deadlock

Remember the above two questions, if these two questions are not answered well, then starvation and deadlock will occur.

Starvation is when a task remains unscheduled, for example due to an inappropriate set of priorities. You are the only honest kid who gives a normal priority to a task he considers unimportant, while other older drivers give their tasks the highest priority. Because high-priority tasks are being submitted all the time, your task keeps getting pushed back, you think you’ll get results soon, and your task may not have started by the end of the day. In that case, you either go along with the flow and make your own tasks your highest priority, or you wait forever, or you face performance and boss pressure for not getting enough done.

Thus, it has risen from a simple task scheduling problem to a test of inner values, which is also a classic process of bad money driving out good money. It’s sad to think that those who follow the rules are punished because a few people don’t.

The deadlock situation is a little bit easier to understand, if you’ve studied operating systems should be familiar with, the principle is the same. For example, if we currently have two tasks, AB, both of which require 2/3 of the cluster’s resources to execute. Since the two tasks were submitted at about the same point in time, the system adopted a first-come, first-allocate principle, allocating 1/2 of the resources to both tasks. This leads to a deadlock because neither task can be executed and neither task will release the resource, so it will continue until the human kill kills one.

As things stand, there doesn’t seem to be a perfect way to completely avoid both. Only the architect can make decisions based on the actual situation of their cluster calls, and there is a human element involved, such as setting rules and regulations among the team. In other words, to some extent this is no longer just a system problem, but a complex problem of system and team coordination.

The scheduler

Let’s take a look at the architecture of common schedulers. There are three common types of schedulers. The first type is a centralized scheduler, the second type is a two-level scheduler, and the last type is a state-sharing scheduler.

Centralized scheduler

Let’s start with the centralized scheduler, which is the most intuitive and simplest:


Its design logic is centralized, there is only one global central scheduler in the whole system. All framework or computing tasks are performed by a central scheduler. It’s kind of like the feudal clan system, where all the big things in the whole family are managed by one person. Obviously, the drawbacks of this approach are numerous, and both of the problems just mentioned require human intervention, otherwise it is difficult to solve.

Later, we improved on this basis and added branch logic to the whole central scheduler. This scheduler is called a multipath scheduler:


So the whole thing doesn’t change much, it just adds a conditional judgment. That is to say, different scheduling and allocation strategies are implemented internally for different types of tasks. For example, if it is a small fragmentation task, then implement priority management, first come, first served policy. If it is a large machine learning task, it will only be executed if the full resource is available, preventing deadlocks, etc.

Compared with single-path centralized scheduling, multi-path centralized scheduling increases some flexibility, but the overall scalability is far from enough, and the concurrency capability is poor, and the utilization of resources is not high enough. If the scale is large, the scheduling performance is easy to become a bottleneck. But it has the advantage of simple architecture and easy maintenance.

Two stage scheduler

Since the centralized scheduler has many problems and is not flexible enough, we added a layer of structure on this basis to increase its flexibility:


We also have a central scheduler in charge, but instead of scheduling tasks directly, the central scheduler allocates resources in the cluster to the framework scheduler in a coarse-grained policy. The logic for scheduling and executing tasks is in the framework scheduler. The framework scheduler executes policies that are more fine-grained than the central scheduler.

In addition, only the central scheduler can see all resources in the entire cluster, and the framework scheduler can only see its allocated portion of resources. YARN and Mesos, which we are familiar with, adopt this architecture.

With the framework scheduler, we can execute different policies in different framework schedulers. This helps improve concurrency and resource utilization across the cluster. So overall, the two-level scheduler performs much better than the centralized scheduler.

But even this is not perfect. Because the central scheduler runs a pessimistic concurrency strategy when scheduling. In the process of allocation, the central scheduler strictly follows the pre-determined order and locks resources to prevent conflicts caused by different frameworks applying for resources. Using pessimistic locks obviously affects overall performance as well.

State sharing scheduler

The architecture of the state-sharing scheduler is very similar to that of the two-level scheduler, which can be easily understood as the result of removing the central scheduler:


This architecture first appeared in Google’s Omega scheduling system, the predecessor of the now-popular Kubernetes. The biggest difference between it and the two-level scheduler is that there is no central scheduler, and all frame schedulers can directly see all resources in the entire cluster. When resources are needed, they are obtained by competing framework schedulers.

And unlike the central scheduler, optimistic locking is used in the state-sharing strategy. Briefly explain the difference between optimistic and pessimistic locks. Pessimistic locking usually assumes the worst case scenario, such as that once a resource is acquired now, it may be accessed or modified by another thread before the end of use, so we need to add locks to avoid this situation.

Optimistic locking, on the other hand, is based on optimistic assumptions, that is, the system is based on the premise that it can run smoothly without resource preemption. That is to say, execute first, and if preemption or other problems occur after execution, retry or other mechanisms can be used to resolve them.

Even in high concurrency scenarios, resource conflict is a relatively low probability time. Pessimistic locking will obviously bring a lot of locking overhead, so the design based on optimistic locking will make the system concurrency performance stronger. But there is a price to pay. Optimistic locking is not perfect. If a large number of competing conflicts do occur, the losing party often needs to retry the task, which can lead to unnecessary overhead and waste of resources.

In addition, since preemption between all frameworks is free, it is possible that higher-priority frameworks will continue to grab resources while lower-priority tasks starve to death. There is no way to guarantee fairness between tasks under this mechanism. This is also an inevitable consequence of weakening the central scheduler.

conclusion

If we review the three strategies, we will see that the order in which the central scheduler was weakened is the order in which the three strategies evolved. It is easy to understand that the central scheduler is powerful enough to maintain the fairness of the whole cluster, but it may become the bottleneck of the whole cluster due to its low efficiency. The weaker the central scheduler, the higher the freedom of the framework scheduler and the stronger the flexibility of the whole system scheduling, which means that the performance of the system is often better.

Somebody made a classic analogy, centralized scheduler is kind of like a planned economy. Everything is planned by the state. The advantage is that it is fair, but there is little freedom and flexibility, and the whole country does not operate and develop efficiently. The two-level scheduler is a bit like a mix of big government and small market, with the government still intervening heavily, but with a little more freedom. The state sharer is a free competitive economic model with small government and big market, where government intervention is almost eliminated and the government becomes the invisible hand, which can further improve the flexibility and efficiency of the country. But less state intervention, when it comes to risks, can lead to big problems.

To some extent, and the national social system is not perfect, the cluster scheduling strategy for now also does not have a perfect, each have each advantage and expertise, we need to combined with the actual situation, the best approach to find suitable for our problem scenario, this is also our learning these underlying principles and not just stay on how to use.

Today’s article is all of these, if you feel that there is a harvest, please feel free to point attention or forward, your help is very important to me.