GitHub, the world’s largest social network for code hosting and programming, has recently switched its services to Kubernetes with the cooperation of development and SRE teams. Because it has tens of millions of users and hundreds of millions of code repositories, this is no small project. This article introduces the whole process of migrating GitHub to Kubernetes.

【 3 days to burn brain type based on the actual combat training camp | CI/CD Docker Beijing station 】 this training around CI/CD of actual combat, based on the Docker concrete content includes: continuous integration and the continuous delivery (CI/CD) overview; Continuous integration system introduction; CI/CD practices on client and server side; Introducing CI and CD into the development process; Gitlab and CI and CD tools; The use of Gitlab CI and Drone and the sharing of practical experience.

Over the past year, GitHub has gradually developed the infrastructure to run Ruby On Rails applications responsible for github.com and api.github.com. Currently all WEB and API requests are made in a cluster of Kubernetes containers running on the GitHub Metal Cloud.

Migrating critical applications to Kubernetes is a very interesting challenge, which I will share with you today:

Why try to change – release SRE engineers

Until now, the main Ruby On Rails applications (Github/Github) were similar to what they were eight years ago: “Unicorn Processes” were called by Ruby process managers to “God” running in puppet-managed. Again, deploy ChatOps as you did when it was first introduced: Capistrano establishes SSH connections on each front-end server, updates the code, restarts the application, and when the peak request load exceeds the available front-end CPU capacity, the GitHub site’s SRE releases the additional capacity and adds it to the active front-end server pool.


New services can take days, weeks, or months to deploy because of their complexity and the timing of the SRE team, and as time goes on, some issues become apparent: this approach does not provide engineers with the flexibility they need to continue building world-class services.

Engineers need a self-service platform On which to experiment, deploy, and extend new services, as well as the same platform for Ruby On Rails applications so that engineers or robots can allocate additional computer resources in seconds, days, or longer to respond to changing requirements.

To meet these needs, the SRE, platform, and developer experience teams began a joint project to deploy github.com and Api.Github.com code to the Kubernetes cluster dozens of times a day.

Why Kubernetes?

To evaluate platforms-as-a-service tools, GitHub took a close look at Kubernetes, a Google project that provides an open source system for automated deployment, scaling, and management of containerized applications, and evaluated Kubernetes features in the following ways: The project has the support of the hot open source community, first run practices (allowing small clusters and applications to be deployed in the first few hours), and a lot of design experience.

The experiment was quickly scaled up: a small project was set up to build Kubernetes clustering and deployment tools in support of the upcoming Hack Week to gain some real-world experience, and internally GitHub reacted positively to the project.

Why start with Github/Github?

During the initial phase of the project, GitHub made a deliberate decision to migrate the GitHub/GitHub workloads critical. Many factors contributed to this decision, the most important ones being:

  • An in-depth look at the application running on GitHub is useful during the migration process
  • Self-service extension tools are needed to handle continued growth
  • You want to ensure that development habits and patterns are applicable to both large applications and smaller services
  • You can better isolate your application from development, Staging, production, Enterprise, and other environments
  • Migrating a critical, high-profile workload can inspire confidence and encourage GitHub to adopt Kubernetes further

Given that the workload we were planning to migrate was critical, we needed to establish sufficient operational confidence before leveraging it to deliver actual production flows.

Introduce review LABS for rapid iteration and confidence building

As part of the migration, some design and prototyping was done and the front-end server was validated using Kubernetes basic services such as Pod, deployment and services. Some validation of this new design was possible by running Gitub/Github’s existing test suite in a container, but it remained to be seen how the container became part of the larger Kubernetes resource, and it soon became clear that during the validation phase, An exploratory test environment for Kubernetes and the services you intend to run is a must.

At the same time, project members observed that the existing Github/Github fetching pattern was beginning to show signs of growing difficulty, with deployment rates proportional to the number of engineers, as well as using several additional deployment environments as part of the process to validate pulling requests to Github/Github. During peak hours, The number of fully functional deployment environments is often fixed, which reduces the deployment pull request Process. Engineers often request that more subsystems be tested in a “Branch Lab”, allowing multiple engineers to deploy at the same time, but each engineer can only launch one “Unicorn Process”. So it’s only useful for testing API and UI changes, because there’s a lot of overlap between these requirements, so it’s possible to combine these projects and start developing a new Kubernetes/ Github deployment environment called Review Lab on Github/Github.

Several sub-projects were released during the building of Review Lab:

  • Terraform & Kops is used for Kubernetes cluster management running on an AWS VPC
  • One set of Bash integration tests used the Kubernetes cluster for a short period of time, and was later used extensively at the beginning of the project to build confidence in Kuberbetes.
  • A lot Dockerfile/lot
  • Enhancements to the internal CI platform to support building and publishing containers to the container registry
  • YAML stands for 50+Kubernetes resources, check github/github
  • Enhancements to internally deployed applications support the deployment of Kubernetes resources from a repository to a Kubernetes namespace and the creation of Kubernetes from an internal repository
  • This service combines Haproxy and Consul-Template to route Unicorn Pods to an existing service and publish service information.
  • A service that reads Kubernetes events and sends exception events to the internal service tracking system
  • An RPC-compatible service called kube-me exposes a limited set of Kubectl commands to users via chat.

The end result is a chat-based interface for creating a GitHub standalone deployment for any pull request, and once the request passes all the required CI tasks, users can deploy their request:


Review Lab is a successful project, which has accumulated a lot of experience and achievements. Before providing such an environment for engineers, it also provides the necessary validation foundation and prototype environment for Kubernetes cluster design, as well as the design and configuration of Kubernetes resources. These resources are now used to describe github/ Github workloads, and after release, it helped build confidence that Github was happy with the environment that enabled engineers to experiment and solve problems through self-help.

Kubernetes on Metal Cloud

With the release of Review Lab, attention shifted to Github.com. In order to meet the reliability requirements for performance of critical services (relying on low latency access to other data services), Kubernetes infrastructure needed to support cloud computing running in physical data centers and POP. Similarly, There are more than a dozen sub-projects:

  • For container networks, GitHub chose Calico for a timely and detailed post, which provides the ability to quickly send a cluster in IPIP mode, while also providing the flexibility to explore later in the network infrastructure.
  • After reading Kubernetes the hard way by Kelesyhightower many times, GitHub assembled some manually operated servers into a temporary Kubernetes cluster that passed the test.
  • Widgets have also been built to generate the REQUIRED CA and configuration for each cluster in a format that can be used by Puppet and Secret Systems.
  • Two instance configurations are handled: Kubernetes nodes and Kubernetes Apiservers, which allows users to provide configured cluster names to be brought in at a specified time.
  • A small Go service is built to consume container logs, append metadata in Key/Value format to each row, and send them to the host’s local Syslog endpoint.
  • Enhanced internal load balancing service to support Kubernetes Node Port.

GitHub is confident that their Inputs are the same set of Inputs: Kubernetes Inputs for Review Lab connected to VPN The same tools produced similar results, and in less than a week, while most of the time was spent on internal communication and sorting, it had a very significant impact on migration: you could move an entire workload from a Kubernetes cluster running on AWS to a cluster running on GitHub data.

Confidence is

With the success and replicability of Kubernetes clusters on Github Metal Cloud, it was time to deploy “Unicorn” to replace the current front-end server pool. At Github, it is common practice for engineers and their teams to validate new features by creating a Flipper feature. Select it if possible, then strengthen the deployment system to deploy a new set of Kubernetes resources, github-Produciton namespace and existing production servers, and improve GLB support for routing employee requests to different backends: Flipper-infuenced based cookie that allows an employee to select the experimental Kubernetes backend on a button in the task control bar.





Cluster Groups

Because some Failure Tests led to unintended results, in particular Tests that simulated the Failure of a single Apiserver node broke the cluster in a way that negatively affected the availability of running workloads, according to the survey, the Tests were inconclusive, But helping to identify the associated breaches may be an interplay between various customers connected to Kubernetes Apiserver (like Calico-agent Kubelet Kube-proxy, Kube-controller-manager) and internal load balancer behavior in a single Apiserver node, because the detection of Kuberntes cluster degradation could disrupt the service, so began to focus on the critical applications running on each site, And automatically migrate requests from one unhealthy cluster to another healthy cluster.

Similar work has been done on GitHub’s flow chart, allowing the application to be deployed to multiple independently operated sites, and other positive trade-offs. The final design chosen was: use deployment system support for multiple “partitions” to enhance it through a custom support to provide cluster boring in-configuration Kubernetes resource annotations, abandoning the existing federated solution to allow use of the business logic already present on GitHub’s deployment system.

From 10% to 100%

With cluster groups, GitHub gradually converted the front-end servers to Kubernetes nodes and increased the traffic routed to Kubernetes, working with other teams of engineers to complete the front-end conversion in just over a month while maintaining expected performance and an acceptable error rate in the meantime.


GitHub does some Failure Tests that simulate kernel errors similar to those triggered by echo C /proc/sysrq, which is a useful addition.

In the future

This article was inspired by the experience of migrating this application to Kubernetes, and is looking forward to more migrations, although the scope of the first migration was intentionally limited to stateless workloads, with great interest in trying to run a stateful service pattern on Kubernetes.

In the final phase of the project, GitHub also released a workflow for deploying new applications and services to similar Kubernetes clusters. Engineers for months has deployed dozens of application in the cluster, each need SRE configuration management and support, have a self-service application workflow of supply, SRE can spend more time on to the other members of the engineering team delivery infrastructure products up to support best practices, for everyone to build faster, more flexible making experience.

Original link:Kubernetes at GitHub