GitHub Kubernetes migration journey

About the author: Jesse Newland is GitHub’s lead website reliability engineer.

Over the past year, GitHub has gradually refined an infrastructure that runs Ruby on Rails applications that are responsible for github.com and api.github.com. We recently entered a major phase where all Web and API requests are handled by containers running in a Kubernetes cluster deployed on our bare-metal cloud.

Migrating critical applications to Kubernetes was an interesting challenge, and we are happy to share our insights.

Why change?

Prior to this step, our main Ruby on Rails application (which we call Github/Github) was configured exactly as it was eight years ago: the Unicorn process was managed by a Ruby process manager named God, which ran on a server managed by Puppet. Similarly, our ChatOps deployment is similar to what we did when we first launched: Capistrano establishes an SSH connection to each front-end server, then updates the code in place to restart the application process. If the peak request load exceeds the available front-end CPU capacity, GitHub Website Reliability Engineer (SRE) will configure additional capacity and add it to the active front-end server cluster.

Previous Unicorn service design

While the underlying approach to the production environment hasn’t changed much over the years, GitHub itself has changed a lot: new features, a larger software community, more GitHub employees, and many more requests per second. As we scale up, this approach exposes new problems. Many teams want to extract the functionality they are responsible for from this large application into smaller services that can run and deploy independently. As the number of services we ran increased, the SRE team began supporting similar configurations for many more applications, increasing the amount of time we spent on server maintenance, configuration, and other tasks that weren’t directly related to improving the GitHub experience overall. New services can take days, weeks or even months to deploy, depending on the complexity of the new service and availability of the SRE team. Over time, it became clear that this approach did not provide our engineers with the flexibility they needed to continue building world-class services. Our engineers needed a self-service platform to test, deploy, and scale new services. We also need the same platform to meet the requirements of that core Ruby on Rails application so that engineers and/or robots can respond to changing requirements by allocating additional computing resources in just seconds rather than hours, days, or longer.

To meet these requirements, the SRE team, the platform team, and the Developer Experience team began a joint project that took us from the initial stage of evaluating the container Choreography platform to the current stage of deploying the code that supports Github.com and Api.Github.com to the Kubernetes cluster dozens of times a day. The purpose of this article is to provide an overview of the work involved in this migration process.

Why Kubernetes?

We initially evaluated the existing market for platform-as-a-service tools and then took a closer look at Kubernetes, a Google project that billed itself as an open source system for automating the deployment, expansion, and management of containerized applications. Several features of Kubernetes stand out from the other platforms we evaluated: a vibrant open source community in support of the project, a good first experience (which allowed us to deploy a small cluster and application within the first few hours of the initial trial), and a wealth of information available.

These experiments quickly expanded in scope: a small project to build Kubernetes clusters and deployment tools to support subsequent Hacker Week events aimed at getting a hands-on feel for the platform. Not only did we feel good about this project, but the engineers who had used it got good feedback, so we expanded the scope of the experiment and started planning a bigger expansion.

Why start with Github/Github?

Early in this phase, we intentionally decided to migrate one key workload: Github/Github. A number of factors contributed to this decision, several in particular:

We know that a deep understanding of the application running on GitHub is useful during the migration process.
We need self-service capacity expansion tools to cope with continued growth.
We want to make sure that the habits we develop and the models we develop work for both large applications and small services.
We want to better insulate applications from the differences between development, commissioning, production, enterprise, and other environments.
We knew that migrating a critical, high-profile workload would encourage further adoption of Kubernetes on GitHub.

Given the importance of the workload we chose to migrate, we needed to build a lot of confidence in operations before dealing with any production-level traffic.

With the help of the audit laboratory, rapid iteration, build confidence

As part of this migration, we designed, developed, and validated a service using the basic units of Kubernetes, such as Pods, Deployments, and Services, to replace the existing service currently provided by the front-end server. Part of the validation of this new design can be done by running Github/Github’s existing test suite in a container rather than on a server configured like a front-end server, but we need to see how the container works as part of a larger set of Kubernetes resources. It soon became clear that during the validation phase, an environment that supported exploratory testing of Kubernetes and the service we wanted to run was essential.

Around the same time, we saw that the existing pattern of tentatively testing Github/Github merge requests was beginning to show signs of increasing trouble. As deployment speeds increased and the number of engineers working on the project increased, utilization of several additional deployment environments used to validate github/ Github merge requests increased. A small number of fully functional deployment environments are usually fully booked during peak working hours, which slows down the process of deploying merge requests. Engineers often request the ability to test more different production-level subsystems in “branch LABS.” While branch LABS allow many engineers to deploy in parallel, only one Unicorn process is started for each engineer, which means it is only useful when testing API and UI changes. These requirements overlapped so much that we merged our projects and started building a new Kubernetes-powered deployment environment for Github/Github called Review Lab.

In building the audit lab, we delivered several sub-projects, each of which may be covered in its own blog post. We also delivered:

A Kubernetes cluster running in an AWS VPC managed using a combination of Terraform and KOPS.
A set of Bash integration tests that tested the temporary Kubernetes cluster was used extensively at the beginning of the project to gain Kubernetes’ confidence.
Dockerfile for Github/Github.
Improved our internal continuous integration (CI) platform to support building clusters and publishing to the container registry.
Using YAML to represent 50 + Kubernetes resources, the code is checked into github/github.
Improved our internal deployment application to support the deployment of Kubernetes resources from the registry to the Kubernetes namespace and the creation of Kubernetes Secret using the internal Secret repository.
A service that combines HaProxy and Consul-Template to route traffic from Unicorn Pod to an existing service that publishes service information.
A service that reads Kubernetes events and sends exception events to the internal error tracking system.
A Chatops-RPC-compatible service called Kube-me provides users with a limited set of Kubectl commands via chat.

The end result is a chat-based interface that can be used to create isolated deployments of GitHub for any merge request. Once the merge request passes all the required CI jobs, users can deploy the merge request to the audit lab like this:

As with the previous branch LABS, the lab was cleaned up one day after the last deployment. Since each lab is created in its own Kubernetes namespace, cleaning up is as simple as deleting the namespace, and our deployment system can automate this task if necessary.

The audit lab has a successful program with a lot of results. Before this environment was opened up to engineers, it served as an essential testing ground and prototyping environment for our Kubernetes cluster design and the Kubernetes resource that now describes the Github/Github Unicorn workload for design and configuration. When it was released, it provided a new way for a lot of engineers to deploy, and helped us build confidence through feedback from interested engineers and continued use by engineers who didn’t notice any changes. Just recently, we observed some engineers on the high availability team using the audit lab to experiment with Unicorn’s behavior with the new testing subsystem by deploying to the shared lab. We were very pleased with how the environment helped engineers experiment and solve problems in a self-help way.

Weekly deployment to branch LABS and audit LABS

Kubernetes on bare metal

After reviewing lab deliveries, we turned our attention to Github.com. To meet the performance and reliability requirements of our flagship service, which relies on low latency access to other data services, we needed to scale out the Kubernetes infrastructure that supports bare-metal clouds running in our physical data centers and access points (POPS). Similarly, this work involves nearly a dozen sub-projects:

The article introduces container network timely and thorough (https://jvns.ca/blog/2016/12/22/container-networking/) to help us chose the Calico network provider, It gives us the out-of-the-box capabilities we need to quickly deliver clusters in IPIP mode, while giving us the flexibility to explore docking with our network infrastructure at a later date.
Focus on the @ kelseyhightower more than 10 articles, especially required Kubernetes the hard way (https://github.com/kelseyhightower/kubernetes-the-hard-way), We assembled several manually configured servers into a temporary Kubernetes cluster that passed the same set of integration tests we used to test the AWS cluster.
We built a gadget to generate the necessary CA and configuration for each cluster, in a format that could be used by our internal Puppet and Secret systems.
We use Puppet to manage the configuration of two instance roles: the Kubernetes node and the Kubernetes API server (Apiserver), by allowing users to provide the name of the cluster that has been configured and to join at configuration time.
We built a small Go service to take container logs, attach metadata in key/value format, and send them to the host’s local syslog endpoint.
We improved GLB, our internal load balancing service, to support the Kubernetes NodePort service.

After all this hard work, we have a cluster that has passed our internal acceptance. Given this, we are confident that the same set of inputs (auditing the Kubernetes resources used by the lab), the same set of data (auditing the network services connected by the lab through VPN), and the same tools will produce similar results. In less than a week — most of the time spent on internal communication and sequencing in case of a major migration impact — we were able to migrate this entire workload from a Kubernetes cluster running on AWS to a Kubernetes cluster running on one of our data centers.

Improved confidence index

With a successful and replicable model for creating Kubernetes clusters on our bare metal cloud, it was time to build confidence that our Unicorn deployment could replace the current front-end server cluster.

At GitHub, this is a common practice: engineers and their teams validate new features by creating a new Flipper feature and then choosing to use the service once it proves feasible. We improved the deployment system by deploying a new set of Kubernetes resources to the Github-Production namespace in parallel with the existing production level servers, and improved GLB to support routing of employee requests to different backends based on Flipper influenced cookies. After that, we allowed employees to choose to use the experimental Kubernetes backend using a button on our mission control bar:

Select the employee UI that uses the Kubernetes-powered infrastructure

The load from internal users helped us find problems, fix bugs, and become familiar with Kubernetes in production. In the meantime, we tried to build confidence by simulating programs we expected to execute in the future, writing manuals, and performing failure tests. We also routed a small amount of product-grade traffic to the cluster to validate our performance and reliability assumptions under load, starting with 100 requests per second and increasing to 10% of the total number of github.com and APi.Github.com requests. As we became familiar with these scenarios, we paused and reassessed the risks of a full migration.

Kubernetes Unicorn service design

Cluster Groups

Several of our failure tests had unexpected results. In particular, a test simulating the failure of a single Apiserver node disrupted the cluster, negatively affecting the availability of running workloads. An investigation of the results of these tests was inconclusive, But help us understand that this outage may be related to the connection to Kubernetes The interaction between apiserver clients (such as Calico-Agent, Kubelet, Kube-proxy, and Kube-Controller-Manager) is related to the behavior of the internal load balancing system during apiserver node failure. Seeing the Kubernetes cluster degrade and potentially disrupt service, we started thinking about running our flagship application on multiple clusters per site and automating the process of moving requests from the failed cluster to other healthy clusters.

We have plans to carry out the similar work in support of the application software deployment to multiple independent operation of the site, and other advantages of this method, finally makes us walk the road, these advantages include: little interference cluster upgrading to cluster with the existing failure domain (such as Shared network and power supply equipment). We finally chose such a design: use our deployment system support deployment to multiple “partition”, has made the improvement to it, so that through the custom Kubernetes resources, support for a particular cluster configuration, abandon the existing joint solutions, to switch to a system allows us to use the deployment in the existing method of business logic.

From 10% to 100%

With the cluster group in place, we gradually changed the front-end servers to Kubernetes nodes and increased the proportion of traffic routed to Kubernetes. Together with many other responsible engineering teams, we completed the front-end migration in just over a month, while ensuring performance and error rates were within target range.

Proportion of Web traffic to the cluster service

During this migration, we encountered a problem that still exists today: kernel panic and restarts of some Kubernetes nodes during periods of high load and/or container loss. While we were not satisfied with this and continued to investigate the cause (seeing it as an important piece of work), we were pleased that Kubernetes was able to automatically bypass these failures and continue to handle traffic at a predetermined error rate. We performed several failure tests, simulating kernel crashes with echo c > /proc/sysrq-trigger, and found this to be a useful addition to our failure test mode.

What’s next?

Inspired by migrating this application to Kubernetes, we look forward to migrating more systems soon. Although the scope of our first migration was intentionally limited to stateless workloads, we were excited with the results of experimenting with a pattern running stateful services on Kubernetes.

In the final phase of the project, we also delivered a workflow for deploying new applications and services to a similar set of Kubernetes clusters. Over the past few months, engineers have deployed numerous applications to the cluster. Each of these applications previously required SRE support for configuration management and resource configuration. With the self-service application configuration workflow in place, SRE can spend more time delivering infrastructure products to the rest of the technologic-based company to support our best practices in our efforts to create a faster and more resilient GitHub experience for all.

Cloud first article compiled, without authorization declined to be reproduced

Related Posts

Further understanding of Spring series xi: SpringMVC-@RequestBody Receiving json datagram 415

Are master Mutex | Go on topic

Java NIO(10) – Automatic Resource Management