[Editor’s note] This is Gene Kim’s summary of a two-hour interview with Randy Shoup on how To improve DevOps at Google. This paper is compiled by OneAPM combined with efficient operation and maintenance.

Randy Shoup, who helped lead engineering teams at eBay and Google, is one of the few people I’ve seen who can clearly describe the leadership qualities needed to build highly productive DevOps and world-class reliability systems. Two of his speeches (2013 Flowcon and his amazing work transforming the architecture of eBay in the early 2000s) were my favorites.

Review images

This article was compiled by Gene Kim based on an interview with Randy Shoup to discuss how DevOps can improve at Google. This paper is compiled by OneAPM combined with efficient operation and maintenance.

Dr. Spear’s model has the following four capabilities:

  • Ability 1: Recognize problems immediately when they occur.
  • Ability 2: Once you find the problem, you can Swarming it and record it as new knowledge.
  • Ability 3: Spread knowledge across the company;
  • Power 4: Development-led.

This was the basis for an interview with Randy Shoup, which also revealed some practice cases not widely discussed at Google and eBay.

(I’ve learned more from Randy Shoup than I can say. For more information and to put it into practice at the company, contact Randy on his LinkedIn page, where he is currently doing consulting work.)

Power 1: Recognize problems immediately when they occur

Dr. Spear writes:

High speed companies use detailed rules and designs to capture problems against existing knowledge bases, and use built-in tests to find problems. Whether working individually or in teams, with or without equipment, high-speed companies are unwilling to accept ambiguity. They detail the following in advance: (a) Expected outputs; (b) Who is responsible for what and in what order; (c) How products, services and information are transferred from the previous person to the next person; (d) Method of completing each part of the work.

GK (Author) : Google should be one of the models in the DevOps space, especially in the automated testing space.

“What goes wrong when you have thousands of engineers sharing a continuous build?” Google SCM team Eran Messeri said during a session at GOTOcon Aarhus in 2013. (The lecture notes can be viewed here.)

He lists some noteworthy statistics (as of 2013) and describes how they created the fastest, most timely, and least costly programmer feedback mechanism they could:

  • 15,000 programmers (including development and operation)
  • 4,000 concurrent projects
  • Check all source code (billions of files!) in one repository.
  • 5500 code submissions via 15,000 programmers
  • Automated tests run 75 million times a day
  • 0.5% of engineers work on tools

Here’s a copy of QConSF PPT from 2010 by Ashish Kumar, where you can read more about the amazing numbers the Google development team has achieved.

Q: Google is probably the poster child for automated testing, and everyone wants to know more about your experience there.

A: That’s true. Google does A lot of automated testing — more than any other paradigm I’ve worked for. “Everything” needs to be tested — not just getter/setter functionality, but everything that can go wrong.

Designing tests is often challenging for everyone. People don’t take the time to write tests to check what they think will work, but rather to test the difficult things that can go wrong.

In practice, this means that the team needs to perform reliability testing. It is often desirable to test one component in isolation, while others use mock components to test their own components in a semi-real world, but more importantly, to inject failures into mock tests.

This way, you can constantly test to find out where the components don’t work. In actual daily testing, there may be a one in a million, one in ten million chance that these components won’t work (for example, two copies of the server went down; Something went wrong between the prep and commit phases; Or the entire server goes down in the middle of the night).

All of this adds up to building recovery tests as part of your daily routine and running them all the time, which is a huge amount of work.

Q: Where did Google’s existing automated testing rules come from?

A: I don’t know how Google’s rules evolved. They were there when I went. It is truly amazing that all components in this massively distributed system are constantly tested in these complex ways.

As a rookie, I didn’t want to write something crappy that hadn’t been tested enough. As a leader, I especially wanted to set a bad example for the team.

Here is a concrete example that illustrates some of the benefits of such a team. As you’ve probably read in well-known papers (Google File System, BigTable, Megastore, etc.), common Google infrastructure services are run independently by teams — often surprisingly small teams.

They don’t just write the code, they run it. As these components mature, they will not only provide services to users, but also provide them with client-side file libraries to make services more convenient. Using an existing library of client files, they can simulate backend services for client tests and inject various failure scenarios. For example, you can use the BigTable production library with an emulator, and it will behave just like the actual production platform. Do you want to fail injection during write and ACK phases? Just do it!

I suspect these principles and practices are “honed the hard way,” honed in emergencies where you keep asking “how do I avoid downtime?”

Over time, the final rules were refined, resulting in a firm framework.

Ability 2: Once a problem is discovered, solve it in a group and write it down as new knowledge.

Dr. Spear writes:

“High speed companies are good at spotting problems in their systems first and finding them. They are also good at: “They (1) identify problems before they spread and (2) identify and solve the causes of problems so they don’t happen again. In doing so, they build a deeper knowledge of how to manage and solve systems that actually work to solve problems, turning inevitable early omissions into knowledge.”

GK: Two of the most surprising examples of cluster problem-solving in my research are:

A: Toyota’s Andon drawstring is responsible for stopping work when it deviates from A known pattern. On record, a typical Toyota plant pulls the Andon cord 3,500 times a day.

Alcoa’s CEO, the respected Paul O ‘Neill, tried to reduce workplace accidents by establishing a policy that he had to be notified within 24 hours of any workplace accident. Who is responsible for the report? General manager of business units.

Q: Is Google’s remote culture similar to those that support Swarming Behaviors, such as Toyota’s pull cord and AlcoaCEO’s requirement to notify employees of workplace accidents?

A: Exactly the same. I can relate to both. At Ebay and Google, there is a blame-free PostMortems culture. (GK: John Allspaw also calls it blameless post-Mortem.)

Post-mortem disclaimer is a very important rule and we will hold a post-mortem whenever one customer is affected by an outage. As John Allspaw and others have extensively described, the goal of post-mortem is not to seek accountability, but to create opportunities for learning and broad communication across the company.

I’ve found that there can be amazing dynamics in a corporate culture where post-mortem doesn’t have consequences: engineers compete with each other to see who screwed up the most. For example: “Hey, we found a backup and recovery program that we never tested,” or “Then we realized that we didn’t actively copy.” For many engineers, this scenario will be familiar: “I wish we hadn’t, but now we have a chance to fix that broken system we’ve been complaining about for months!”

This leads to massive corporate learning and, as Dr. Steven Spear describes it, allows us to constantly find and solve problems before catastrophic consequences occur.

I think it works because we are all engineers at heart who like to build and improve systems, and the environment in which problems are exposed makes for an exciting and satisfying work environment.

Q: What were the results of the post-mortem? You can’t just document it and throw it in the trash, right?

Q: You may find this hard to believe, but I believe the most important part is organizing the post-mortem itself. We know that the most important part of DevOps is culture, and being able to organize meetings can improve the system even if there is no output.

A: It becomes A kind of kata — part of our daily routine that proves our worth and how to prioritize work.

Of course, post-mortem is almost always a backlog of what worked and what didn’t work, and then you have a backlog of things to add to the work queue (e.g. : Backlog — list of required features; Enhancements – improved documentation, etc.)

When you find that you need to make a new improvement, you eventually have to make a change somewhere. Sometimes it’s documentation, processes, code, environments, or something else.

But even without that, there’s a huge value in just documenting postmortem documents, because you can imagine, at Google, you can search for everything, and everyone at Google can see every postmortem.

In the future, the post-mortem will always be the first thing to be read in the event of a similar accident.

Interestingly, post-mortem documents serve another purpose. Google has a long tradition of requiring developers to manage all new services themselves for at least six months. When a service team asks for “graduation” (i.e. a dedicated SRE team, or operations engineer to maintain it), they generally consult with SRE. They asked SRE to take over responsibility for application submission.

Gene: Click to view the video, which Tom Limoncelli calls the “switch to pre-launch verification” process, where SRE reviews documents, deployment mechanisms, monitoring profiles, and so on. Great video!) SRE often first examine the post-mortem document, which is a big part of the process, to determine whether they can “graduate” an application.

Q: Do you see requests at Google like Paul O ‘Neill and his team did at Alcoa? Are there any examples of how the notification/upgrade barrier is decreasing?

A: GK: Dr. Spear describes how Paul O ‘Neill led A team at Alcoa that reduced the number of injuries on the aluminum plant floor (amazingly, it was filled with hot, high-pressure, corrosive chemicals) from 2% to 0.07% per year, making the company one of the safest in the industry. Amazingly, when the accident rate on the shop floor dropped below a certain level, O ‘Neill asked employees to notify him when something might go wrong.

There is. Of course, an incident at the workplace is our equivalent of an outage that affects our customers. Trust me, when there is a major failure that affects customers, they will be notified. In the event of an accident, two things happen:

  1. You need to mobilize all the people needed to restore service, and they need to work continuously and directly to solve the problem (which is standard procedure, of course).

  2. We also have weekly incident meetings for management. (On my App Engine team, I’m one of two people who attend the engineering lead; Our boss, our team leader, support team and product manager). We review what we have learned in the post-mortem, examine what we need to do next, and make sure we have properly addressed the problem. If necessary, decide if we need to do a post-mortem or post a blog.

Sometimes we have nothing to say. But once the situation is under control, the team always wants peer reviews to have fewer problems and to improve. For example, a problem that “doesn’t affect the customer” is called a problem that “affects the team.”

Most of you have experienced the “near misses,” where you deploy six safeguards, all set up to keep the user from being affected by failure, and they all fail except one.

On my team (Google App Engine), we probably have one public outage that affects users every year, but of course there are several near misses behind every one of these.

This is why we have Disaster Recovery, as Kripa Krishnan has discussed here.

While Google did a good job and we learned a lot (which is why we had three production backups), amazon did a better job here and their work was five years ahead of everyone else’s. (Jason McHugh, an amazon S3 architect who is now on Facebook, gave this talk, QCon 2009, on Fault management at Amazon.)

Q: At Alcoa, workplace incidents need to be reported to the CEO within 24 hours. Is there a similar timeline for an upward escalation of issues to leadership at Google?

A: At Google App Engine, we have A very small team (100 engineers worldwide) and there are only two levels: engineers who do things and managers. We used to wake people up in the middle of the night when there was an incident that affected customers. For every one of these incidents, one in 10 escalates to the company’s leadership.

Q: How did you describe how Swarming happened?

A: Like in A Toyota plant, not everyone is on hand to solve every problem. But culturally, we do prioritize reliability and quality over priority zero.

This happens in many ways, some of which are less obvious and more subtle than downtime.

When you examine code that breaks tests, there is no other work to be done in fixing them, and there are no more tests that need to be broken.

Similarly, if someone is having a similar problem and needs help, they will expect you to drop everything and help. Why is that? This is how we prioritize — like the “golden rule.” We want to help everyone move forward, and that helps everyone.

Of course, they will do the same to you when you need help.

From a system perspective, I think it’s like a ratchet or the middle gear of a roller coaster — it keeps us from sliding down.

This is not a formal rule in the process, but everyone knows that if there is something obviously abnormal like this that affects users, we will send out an alert, send out some emails, etc.

The message is usually “Hi everybody, I need your help” and we go and help.

The reason I think it always works is that, even without formal instructions or rules, everyone knows that our job is not just to “write code”, but to “run services”.

Even global dependencies (such as load balancers, global infrastructure misconfiguration) can be fixed in seconds, and incidents can be resolved in 5 to 10 minutes.

Power 3: Spread new knowledge throughout the company.

Dr. Spear writes:

High speed companies increase the acquisition rate of new knowledge by spreading knowledge throughout the company (not just discoverers). They share not only the conclusion of the problem, but also the process of finding it — what they learned and how they learned it. While in larger systems their competitors find problems and leave them where they were found, along with solutions, in high-velocity companies the responsible person spreads the problem and discovery across the company. That is, when people start a job, they absorb the experience of others in the company. We’ll see a couple of examples of multiplier effects.

Q: How does knowledge spread when problems occur? How do local discoveries translate into progress of a global nature?

A: Part of it, though not the largest part, is the documentation from the post-mortem. There are signs that Google is as prone to mishaps as any other company. Whenever there’s a high-profile outage at Google, you can be sure that almost everyone in the company will read the post-mortem.

Perhaps the most powerful mechanism for preventing future failures is a single code base owned by all of Google. But more importantly, since you can search the entire code base, it’s easy to leverage someone else’s experience. No matter how formal or consistent the documentation, it’s better to look at what people do in practice — “look at the code.”

However, there is also a negative side. Typically, the first person to use a service might use a random configuration that spreads like wildfire throughout the company. Suddenly, for no reason, random Settings like “37” are circulating everywhere.

As long as you make knowledge easy to disseminate and accessible, it will spread everywhere, and there will probably be some optimal Settings.

Q: Besides a single source code base and irresponsible post-mortem, what other mechanisms can be used to transform local learning into global improvement? What else can we do to spread knowledge?

A: One of the great things about being in the Google source code base is that you can find everything. The best answer to any question is “look at the code.”

Second, there are excellent documents that can be found just by searching. There is also an excellent internal team. Just like any external service, writing “foo” creates a mailing list called “foo-user.” You ask the person on the list a question. It’s great to get in touch with the developer, but in most cases it’s the user who answers. By the way, there are plenty of successful open source projects in the industry.

Power 4: Development-led.

Dr. Spear writes:

Managers in high speed companies recognize that part of the routine is to deliver products and services and to continuously improve the process of delivering products and services. They teach people how to continuously improve that part of their job and provide them with plenty of time and resources. In this way, the company can improve itself in terms of reliability and high adaptability. This is the fundamental difference between them and their failed competitors. Instead of ordering, controlling, berating, threatening, or evaluating others through a series of metrics, managers in high-velocity companies make sure that the company is improving in self-diagnosis and self-improvement, problem detection skills, problem solving, and efficiency by spreading solutions across the company.

GK: I also like The quote by David Marquet (author of “Turn This Ship Around”) : “The mark of a true leader is how many leaders are underneath him or her.” The former submarine commander has produced more leaders than any submarine captain in history.

The gist of his work is this: Some leaders solve problems, but once they leave, the problems resurface because they fail to make the system work without them.

Q: How did Google’s leadership evolve?

A: Google has done almost everything you can find in any healthy company. We have two types of career paths: the engineer path and the manager path. Anyone with the word “manager” in their job title is mainly responsible for “making things possible” and encouraging others to lead.

I see my role as creating small teams where everyone matters. Each team is a symphony orchestra, as opposed to a factory — everyone can play solo, but more importantly, they can all work with each other. We’ve all had the bad experience of team members yelling at each other or not listening to each other.

At Google, I think the strongest influence of a leader is our cultural vision for building important engineering projects. One of the big cultural norms is, “Everyone writes great tests; we don’t want to be the team that writes crappy tests.” Also, having a culture of “we only hire participants” — that’s emotionally important to me.

At Google, some of this stuff is codified in the evaluation and improvement process — which sounds bad because it means we just do the job we need to do to improve. But on the other hand, the evaluation process is highly praised and almost universally acknowledged to be substantial — people gain knowledge of advancement because they contribute and are good at what they do. I’ve never heard of anyone getting promoted because they “hugged the right thighs and brown-nosed the right way.”

The main criteria for managerial and supervisory positions is leadership — that is, whether that person has made a significant impact, far more than the team you work for and the “person doing his or her own thing.”

The Google App Engine service was launched seven years ago with an amazing group of engineers in the cluster management group who thought, “Hey, we have all these technologies to create scalable systems. How about we build one that other people can use?”

The title of “App Engine builder” is given to those who are highly respected internally, such as the founders of Facebook.

Q: How does the new manager do things? If the leader must train other leaders, how can the new manager or front-line manager understand the mitigation risks?

A: At Google, you only get the job you’re already doing, unlike most other companies, where you get the job you wish you could do.

That is, if you want to be a chief engineer, do the job of a chief engineer. At Google, like many large companies, there are many training resources.

But in most cases, cultural norms about how work is done are so powerful that ensuring that cultural norms persist may be the overarching trend. It’s like a self-selecting process that reinforces cultural norms and technological practices.

Of course, it also has to do with the style at the top. Google is a company founded by two geeky engineers, and the culture is growing under the influence of high-level style.

If you’re in a command-and-control company where the leader hates people, that message is transmitted and reinforced in the company.

conclusion

Again, I’ve learned more from Randy Shoup than I can say. If you’re interested in learning more and applying it to your company, ask Randy directly on LinkedIn, where he works as a consultant. Visit Randy’s Linkedin page for more contact information.

Uncovering The DevOps Improvement Principles At Google (Randy Shoup Interview)

This article is compiled by OneAPM engineers. OneAPM is an emerging leader in application performance management, enabling enterprise users and developers to easily achieve: slow application code and real-time fetching of SQL statements. For more technical articles, visit OneAPM’s official blog.

This article is for learning and communication purposes only and does not represent the views of the Turing community. For non-commercial reprint, please indicate the translator and source, and keep the original link of this article.