Google is definitely ahead of the curve when it comes to managing data centers, because that’s where the scale is. Today we’ll take a look at how Borg, Google’s internal large cluster management system, works. The original text is here http://delivery.acm.org/10.1145/2750000/2741964/a18-verma.pdf?ip=199.201.64.2&id=2741964&acc=OA&key=4D4702B0C3E38B35%2E4 D4702B0C3E38B35%2E4D4702B0C3E38B35%2E5945DC2EABF3343C&CFID=821467408&CFTOKEN=85908090&__acm__=1508654694_a91a706f1002f46 E69bca69f7f540b03. This large cluster management system is the predecessor of Kubernetes, interested in knowing the specific information of Kubernetes can see here (Kubernetes).
Before we get to large cluster management systems, let’s talk about why we need them.
First, the most important job of systems like Borg and Kubernetes is to save the company as much money as possible without affecting the efficiency of your software. If a company’s server software is a component to open provision (for example, I know that Microsoft is a group to buy machines separately), then a lot of machines need to use different types of resources, there will be a degree of waste. For example, the resource share of each machine in a company that doesn’t stack software at all might look something like this
But if you allow people to overlay, it looks like this:
We saved the company half the money by using two fewer machines. In the average small Internet company this is not a big deal, but in a company like Google or AWS that spends billions of dollars a year on servers this level of savings can be in the hundreds of millions, so big companies work on this kind of thing.
Second, Google is a company that needs a lot of MapReduce work to optimize its search engine and prepare data for the machine’s semester training, so it needs a lot of machines that can run MapReduce. If I run MapReduce in separate clusters it is a huge waste. Can I overlay MapReduce work on all the machines and then run MapReduce whenever the machines have resources? The paper also shows that Google actually does the same thing, so the graph looks like this:
This maximizes the amount of CPU MapReduce can extract from the cluster. For companies such as Google that trade CPUS for money, this can make them more profitable. Borg software also handles the scheduling work required by MapReduce, so systems like Borg also have a scheduling part.
Here is an overview of the Borg’s overall architecture
- Each cluster is called a cell and has a corresponding BorgMaster.
- There are several layers of Borgmasters drawn on purpose. There are 5 Borgmasters per cell for high availability, but only one of these 5 Borgmasters is the real leader. The five Borgmasters have a PaxOS-based store.
- BorgMaster software generally does not have too many external dependencies. According to the article, after the leader of BorgMaster is selected, he will go to Chubby and tell everyone who is the leader. However, Borg itself does not rely on Chubby or any other internal services of Google. This is important because one of the most important features for Borg is availability.
- There will be a Borglet on each machine, and the BorgMaster will communicate with the Borglet regularly about what you need to do on this machine, what new software you need to open and so on
- One design is that the Borglet does not actively communicate with the BorgMaster, because if there is a sudden power outage or something, there will be a large number of Borglets trying to communicate with the BorgMaster at the same time, and the BorgMaster will be knocked over by too many visits.
- A link shard is added in the middle of the bridge between BorgMaster and Borglet, which can only transmit changes of Borglet to BorgMaster, so as to reduce the cost of communication and processing cost of BorgMaster.
The figure above shows the life cycle of each task. The main things users can do are
- Submit a job
- Destroy a job (kill)
- Update a job
More changes are controlled internally in Borg. Here updates are generally incremental within a cell. To destroy a job, use SIGTERM+SIGKILL. Here are some other Borg concepts
- Alloc: Each job can tell the system how much resources it needs to reserve for me, so that the system can ensure that it doesn’t overwrite too aggressively
- Priority: You can tell Borg the priority of your work, and higher-priority applications can kick lower-priority applications
- Overall work quota (quota) : This part is used to decide whether to accept the work when it is sent to borg. The quota is made during Capacity Planning. It is a project resource ratio issue, not a software issue.
- How to send work to Borg: Each application discovers Borg through a file in Chubby, which identifies each BorgMaster’s cell and its network information. After that comes the generic RPC.
Now let’s talk about how Borg does scheduling. The scheduling algorithm has two parts: feasibility checking and scoring.
- The feasibility check will identify all the machines that can be arranged, and the most suitable machine will be selected when scoring
- Again, the scoring is going to be based on which machine is the least expensive to put the work on, for example
- Minimize other tasks that need to be eliminated
- The software needed for the job is already on this machine
- Decentralized failure domains (this concept is common in storage systems)
- To optimize the system’s response time, the score is cached and recalculated until the machine changes
- In order to reduce the amount of computation, the scores scored by the same machine for the same requirements are also placed in the cache
- Every time we check the feasibility, we don’t check the feasibility of all the machines, as long as we randomly check enough machines to meet the requirements
So in simple terms, the following process is followed for each dispatch:
In the beginning, Borg distributed work equally among all machines. One advantage of doing this is that spike can handle it well, but the work is very fragmented, so the final algorithm will make a trade-off between fragmentation and spike reservation.
Having learned so much about Borg’s system concepts, let’s take a look at how Borg works. The Paper also mentioned that batch jobs of other companies are separated from PROD. The Paper also looked at how much more money Google needs if Batch jobs are separated:
And you can see that there’s a huge increase in utilization. The Paper also examined the effect of stacking on Cycles per Instruction (CPI) and concluded that the effect was not significant. They also studied whether it is better to have a large cell or a small cell, and concluded that the larger the cell is, the better. Here, it is necessary to balance the availability of Borg and the resources saved by a large cell. Borg has made the following optimizations to save more resources for Google
- Fine Grain Resource Control: Allows applications to require one decimal point of CPU, because sometimes one CPU is not needed at all
- Resource reclaimation: Although the program requires a certain amount of resources, Borg reduces this requirement through his own judgment. For example, he can get more OOM in exchange for more extreme stacking. The effect is shown below.
It can be seen that Borg’s judgment is sometimes acceptable, but the resource share can be increased through more extreme stacking (the less yellow the better).
- In Isolation, Google uses chroot and cgroup, while VM is only used in externally oriented departments such as GCE
- Borg divides programs and resources into two categories in terms of per-machine usage control:
- Each application flags whether it requires latency sensitive.
- Each resource is divided into whether or not it can be compressed. Memory and hard disk are incompressible resources, CPU and network are compressible resources
- Applications that need to respond quickly will be prioritized for resources that can be compressed, while applications that do not need to respond quickly will be throttle when they want to obtain resources that can be compressed
- If programs that don’t need a fast response are starved for too long, they will be destroyed
- Incompressible resources are destroyed if low priority programs occupy too much of them
To do this, Borg’s people optimized the Scheduling of the Linux system to ensure that the system responded quickly enough to these scheduling changes.
So after so many years of doing Borg, what is the experience that borg people can share with us?
- The Borg’s most basic unit is a job. There is no way to assign more complex tasks to a job, so many internal users need to hack something weird to express topology. This is solved in Kubernetes, you can use label.
- Borg assumes that a host can only have one IP, which causes Borg to allocate port as a resource. Kubernetes can now have multiple IP addresses.
- Borg has made a lot of optimizations for big users, and all the apis are very complex, but the average user doesn’t use many of the apis at all. Kubernetes did a good job.
- The unit of resource allocation is very important, called Pod in Kubernetes
- Cluster control is not just task Management, load balancing and Naming Kubernetes support service
- A good debugging tool is very important, and Kubernetes basically copies all Borg debugging capabilities
- BorgMaster, like the entire kernel cluster, will need to do a lot of things that kernel can do but Borg can’t
Google’s experience in managing large clusters is worth learning from, and I hope Kubernetes can go further and further. So let me write it down at the end, The original http://delivery.acm.org/10.1145/2750000/2741964/a18-verma.pdf?ip=199.201.64.2&id=2741964&acc=OA&key=4D4702B0C3E38B3 here 5%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35%2E5945DC2EABF3343C&CFID=821467408&CFTOKEN=85908090&__acm__=1508654694_a91a706f10 02f46e69bca69f7f540b03