OpenAI's production practice of nearly ten thousand nodes in Kubernetes cluster

OpenAI has scaled up the Kubernetes cluster size to 7,500 nodes, providing a scalable infrastructure for large neural network models such as GPT-3, CLIP and DALL·E, as well as small experimental studies. It is rare to scale a single Kubernetes cluster to this size, with some necessary improvements, but the benefit of a single infrastructure is that our machine learning research team can scale quickly to shorten lab times and speed up development without changing the code.

Since our last article on scaling up to 2500 nodes, we have continued to expand the infrastructure to meet the needs of researchers and have learned a lot more about it in the process. This article summarizes the lessons learned so that others in the Kubernetes community can benefit from them, followed by issues that need to be addressed.

First, workload

First of all, the applications and hardware we run on the Kubernetes cluster for the workload are completely different from the scenarios in other companies. The problems we face and the corresponding solutions may not be very consistent with the actual situation in which the reader finds himself.

Large machine learning jobs can access multiple nodes and all hardware resources on each node, so they run most efficiently. Allows the GPU to use NVLink for cross-communication, or the GPU to use GPUDirect to communicate with the NIC. Thus, for many of our workloads, a single pod occupies the entire node, so scheduling does not involve any NUMA, CPU, or PCIE resource preemption. The current cluster has full bidirectional bandwidth connectivity, so there is no need to consider any network topology. As a result, the pressure on the scheduler is relatively low.

Because a new task may contain hundreds of Pod scheduling requirements, kube-Scheduler has burrs.

The largest job is running MPI (parallel computing), and all the pods in the job work in the same MPI communicator. The death of any Pod causes the entire job to pause and restart. Job periodically backs up related information (checkpoint) and recovers the data from the latest backup information when it is restarted.

We don’t rely entirely on Kubernetes for load balancing. Our tier 7 traffic was minimal because there was no A/B testing, blue-green upgrades, canary releases, etc. Pods communicate directly (somewhat suspiciously) with the MPIS of other pods through SSH, rather than service endpoint. Service discovery is relatively limited because we only perform a lookup once, at work startup (when pod first participates in MPI).

Most jobs interact with BLOB-type stores, typically transferring shards of data sets directly to bloBs or caching them on their own turf. We also use some PersistentVolumes, but blob-type storage scales better and does not require mount and unload operations.

The supercomputing team works hard to provide a production-level computing infrastructure, where applications currently running on the cluster have short life spans and developers are rapidly iterating. New application scenarios can emerge at any time, requiring us to anticipate trends and make appropriate tradeoffs.

Second, the network

As the number of nodes and PODS in the cluster increased, Flannel found it difficult to keep up with demand. Switch to host POD networking technology for IP configuration of Azure VMSSes and associated CNI plug-ins. This allows us to get host-level network throughput on Pod.

Another reason we switched to alias-based IP addressing was that on our largest cluster, we probably had about 200,000 IP addresses in use at any one time. When testing a route-based Pod network, we found a significant limit on the number of routes.

Retrofitting an SDN or routing engine would be a hassle, but it would make our network setup much easier. VPNS or tunnels can be added without any additional adapters. And we don’t have to worry about packet fragmentation because some parts of the network have low Mtus. Network policy and traffic monitoring are simple; There is no ambiguity between the source and destination of the packet.

We used iptables on the host to track network resource usage for each namespace and POD. This allows the researchers to visualize its network usage. Because many of our experiments had unique external and internal Pod communication patterns, they were useful for investigating where bottlenecks might occur.

The Iptables mangle rule can be used to mark any packet that meets certain criteria. Here are the rules we use to detect whether traffic is internal or external. The FORWARD rule covers traffic from the Pod, as well as INPUT and OUTPUT traffic from the host:

iptables -t mangle -A INPUT ! -s 10.0.0.0/8 -M comment --comment "iptables-exporter openAI traffic= Internet -in" iptables -t mangle -a FORWARD! -s 10.0.0.0/8 -M comment --comment "iptables-exporter openAI traffic= Internet -in" iptables -t mangle -a OUTPUT! -d 10.0.0.0/8 -m comment --comment "iptables- run openAI traffic= Internet out" iptables -t mangle -a FORWARD! -d 10.0.0.0/8 -m comment --comment "iptables-exporter traffic= Internet out" -d 10.0.0.0/8 -m comment --comment "iptables-exporter traffic= Internet out"Copy the code

Once marked, Iptables starts counters to track bytes and packets that match this rule.

% iptables -t mangle -L -v Chain FORWARD (policy ACCEPT 50M packets, 334G bytes) pkts bytes target prot opt in out source destination .... 1253K 555M all -- any any anywhere ! 10.0.0.0/8 /* IPtables -exporter openAI traffic= Internet out */ 1161K 7937M all -- any! Anywhere /* IPtables - MY friend openAI traffic= Internet -in */Copy the code

We use an Iptables-Exporter scheme based on Prometheus and plug it into our monitoring system.

A special aspect of our network model is that we expose the node, container, and service network CIDR scope to researchers. We have a radial network model and use native nodes and Pod CIDR ranges to route this traffic. The researcher connects to a central node from which any single cluster can be accessed. But the clusters themselves cannot communicate with each other. This ensures that clusters are isolated from each other and that there are no dependencies across clusters to break the isolation. .

We use host NAT to translate service network CIDR to handle traffic from outside the cluster. This setup gives our researchers great flexibility in how they choose to conduct experiments and which network configurations they choose.

Third, the API Server

Kubernetes’ API Server and ETCD clusters are key components of a healthy cluster, so we pay special attention to the stresses on these systems. We used Grafana from the Kube-Prometheus project and other internal dashboards. We found that HTTP alerts for API Server (status 429, 5XX, etc.) were effective.

Although most people run API Server within a K8S cluster, we chose to run it outside the cluster. Both ETCD and API Server services run on their own dedicated nodes. Our largest cluster runs 5 API Servers and 5 ETCD nodes to spread the load and minimize the impact if one of them fails. We haven’t had any trouble with ETCD since we wrote Kubernetes Events to other ETCD clusters in our last article. API Server is stateless and is usually easy to run in self-healing instance groups or scale sets. We have not yet attempted to build any automation features such as self-healing for etCD clusters.

API Server takes up a considerable amount of memory and increases linearly with the number of nodes in the cluster. For a cluster with 7500 nodes, we observed maximum usage of 70GB per API Server.

Another big pressure on API Server is the WATCH capabilities on the API, such as Kubelet and Node-Exporter. This WATCH is triggered when a node is added or removed from the cluster. And because each node itself typically monitors the Kubelet service via kube-Proxy, the bandwidth required for these responses is quadratic to the nodes, sometimes up to 1GB/s or more. The EndpointSlices feature in Kubernetes 1.17 brought huge optimizations, reducing this load by a factor of 1,000.

In general, we pay close attention to any API Server requests that scale with the size of the cluster. We tried to avoid having any DaemonSet interact with the API Server. Introducing an intermediate caching service (such as the Datadog Cluster Agent) seems like a best practice to avoid cluster-wide bottlenecks when there is a real need to change the monitoring components of all nodes.

As the number of clusters grew, our automatic scaling of clusters decreased. Sometimes we get into trouble when we overscale. When new nodes are added to the cluster, many requests are generated, and adding hundreds of nodes at a time can overload API Server services.

Four, monitoring,

We used Prometheus to collect metrics, and Grafana to configure graphical interfaces, manage dashboards, and alerts. We started by deploying the Kube-Prometheus project, which collects various metrics and provides good dashboards for visualization. Over time, we added many of our own unique dashboards, metrics, and alerts.

As the number of nodes increased, we found the large number of metrics collected by Prometheus useless. Although Kube-Prometheus has released a lot of useful data, some of it has never been used. We used the Prometheus interface to remove some of these metrics.

For some time now, we’ve been grappling with the issue of Prometheus consuming more and more memory until finally OOM. This seems to happen even after setting up a large memory capacity. Even worse, when it crashes, it takes a lot of time to recover after starting up.

Ultimately, we found the source of these OOM’s as an interaction between Grafana and Prometheus, where Grafana called the Prometheus interface/API /v1/series query. The/API /v1/series interface gets all the metrics, which leads to continuous memory growth. We improved Prometheus to include this timeout control in the Context.

Although Prometheus crashed much less frequently, WAL recovery remained an issue when it did need to be restarted. It usually takes a long time for Prometheus to recover all WAL logs before collecting new metrics and servicing queries. With the help of Robust Perception, we found optimization by configuring GOMAXPROCS = 24. Prometheus tries to use all cores during WAL replay, and preemption reduces performance for servers with a large number of cores.

Fifth, health examination

For such a large cluster, automation is of course required to detect and remove nodes in the cluster that are behaving abnormally. With the gradual deepening, we have established a set of perfect health inspection system.

A. Passive check

Some health checks are passive and run on all nodes at all times. They monitor basic system resources, such as network accessibility, disk corruption or disk full, or GPU errors. There are many different problems with gpus, but one common error is an UNcorrectable ECC error. Nvidia’s data center GPU Manager (DCGM) tool makes querying for this and many other Xid errors a lot easier. One way we keep track of these errors is through DCGM-Exporter, which feeds metrics into our monitoring system, Prometheus. It is the DCGM_FI_DEV_XID_ERRORS indicator. In addition, the NVML device query API exposes detailed information about the health and operations of the GPU.

Once we detect errors, they can usually be fixed by resetting the GPU or system.

Another form of health check is tracking maintenance events from upstream cloud providers. Most cloud providers provide a way to know if the current virtual machine is being interrupted due to an upcoming maintenance event. For example, install or upgrade patches and replace hardware.

These passively run monitors run on all nodes. If the health check starts to fail, the node will automatically set up an alarm, and for more serious health check failures, we will also try to deport the container. This operation group is determined by the Pod itself, and can be configured with Pod Disruption Budget to decide whether to allow such deportations.

B. GPU dynamic test

Unfortunately, not all GPU problems manifest as error code visible through DCGM. We’ve built our own test libraries that leverage the GPU to catch other issues and make sure the hardware and drivers are performing as expected. These tests cannot be run in the background, and they need to monopolize the GPU for a few seconds or minutes.

All nodes join the cluster with preflight stains and labels. This stain prevents regular pods from being scheduled on the node. DaemonSet is configured to run precheck test Pods on nodes with this label. Upon successful completion of the test, the test itself will remove preFlight stains and labels, and the node can then be used for general use.

We then run these tests periodically throughout the life of the node. We run it as a CronJob so that it can run on any available node in the cluster.

5. Resource quota and consumption

As our cluster grew, however, researchers began to find it difficult to get all the capacity allocated. Whereas traditional scheduling systems have many different capabilities to ensure that tasks run fairly between teams, Kubernetes does not. We took inspiration from these scheduling systems and built some functionality in a Kubernetes native way.

The stain

We have one service in each cluster, team-Resource-Manager, with multiple capabilities. Its data source is ConfigMap, which specifies tuples (node selectors, team labels to apply, allocated quantities) for all research teams that have capacity in a given cluster. It uses openai.com/team=teamname:NoSchedule to adjust the appropriate number of nodes.

The team-Resource-Manager also configures an Admission Webhook service to apply tolerance based on submitter’s team membership for each assignment submitted. By using stains, we can flexibly constrain the Kubernetes Pod Scheduler, such as allowing arbitrary tolerance for lower-priority pods, which allows teams to share resources without strong coordination.

CPU & GPU balloons

In addition to using cluster-AutoScaler to dynamically expand the virtual machine cluster, we also use it to manage (remove and re-add) abnormal nodes in the cluster. To do this, we set the minimum of passion to zero and the maximum of the cluster to the available capacity. However, if the Cluster-AutoScaler sees an idle node, it will try to shrink down to just the required capacity. This idle extension is not ideal for a variety of reasons (VM startup delays, pre-allocated costs, the impact of the API Server mentioned above).

As a result, we introduced Balloons Deployment for CPU and GPU hosts. This Deployment contains a low-priority container configuration with a maximum number of containers. These pods occupy resources within the node, so the cluster-autoScaler will not treat them as idle. However, because they are of low priority, the scheduler can immediately kick them out to make room for real work. (We chose to use Deployment rather than DaemonSet to avoid thinking of DaemonSet as an idle workload on the node.)

One thing to note is that we use container antiaffinity to ensure that containers are evenly distributed across nodes. Performance problems with this algorithm have been corrected since Kubernetes 1.18.

Vi. Gang Scheduling

Our experiments typically involve one or more Statefulsets, each in a different part of the training effort. For the optimizer, the researcher needs to schedule all the pods of StatefulSet before any training (because we often collaborate with MPI among the optimizer members, and MPI is sensitive to group membership changes).

By default, however, Kubernetes does not have to prioritize a StatefulSet request. For example, if two lab jobs each request 100% of the cluster capacity, but Kubernetes may schedule only half of each lab Pod, resulting in a scheduling deadlock and neither lab job completes.

We tried to implement a custom scheduler, but encountered some extreme cases that caused conflicts with regular Pod scheduling. Kubernetes 1.18 introduces the Kubernetes Framwork Plugin architecture, which makes it much easier to add such functionality locally. We recently introduced the Coscheduling plug-in to address this problem.

Seven, conclusion

There are still a number of issues that need to be addressed when scaling the Kubernetes cluster. Some of these include:

A. Monitoring indicators

For our size, Prometheus’s built-in TSDB storage engine was slow to compress and took a long time to restore WAL (write-Ahead-log) on each reboot, which caused us a lot of trouble. We are migrating to other Storage and query engines that are compatible with Prometheus. Look forward to future blog posts on how it develops!

B. Pod Network traffic shaping

When we scale the cluster, each Pod is counted as having a certain amount of Internet bandwidth, and the total traffic of all pods is so overwhelming that we need to introduce traffic shaping technology to prevent network storms, traffic flooding, and so on.

We found Kubernetes to be an incredibly flexible platform for our research needs. It has the ability to scale to meet the most demanding workloads we require. OpenAI’s supercomputing team will continue to explore how Kubernetes can scale, although there are still many areas for improvement.

For follow-up information, please check the public account: DCOS

By Benjamin Chess and Eric Sigler

The resources

[scaling – kubernetes – to – 7500 – nodes] (openai.com/blog/scalin…

OpenAI’s production practice of nearly ten thousand nodes in Kubernetes cluster