At the DevOps Enterprise Summit in London in June, Google customer reliability engineer Stephen Thorne gave a talk clarifying SRE (Site reliability Engineering, Site Reliability Engineering), and pointed out why many enterprises do not understand the basic premises and advantages of SRE.
Thorne sees a major misperception among companies about SRE s that they conflate SERVICE level objectives with Service level Agreements. SLO focuses on early fault detection, whereas SLAs are often used as financial compensation for failures that have occurred. Rather than enforcing error budgets and requiring SRE teams to spend at least half of their efforts on improving systems and tools, THEY keep people on the run, thus also called “extinguishing fires” in a production environment.
Thorne adds that the SLO is the foundation for spotting problems early, ideally before customers feel the impact of a problem. A good SLO matches the customer’s output (for example, service availability, response time, etc.) to reflect whether a system (behavior) meets the user’s needs. System monitoring resource usage (such as CPU utilization, network throughput, and so on) is mentioned, but these measures themselves should not be used as SLOs. Thorne believes that “if the customer is happy, then the SLO is happy.” Some typical SlOs at Google include:
-
99.9% monthly uptime (only 43 minutes of downtime per month).
-
99.99% of HTTP requests per month return “200 OK” successfully.
-
50% of HTTP returns within 300 milliseconds.
Slas usually come into play when the customer is already dissatisfied with the service, so slAs do not actively improve system reliability. In addition, SLAs can cause the wrong behavior. For example, if at this point there is a TWO-HOUR SLA for fixing E-mail problems and a one-day SLA for fixing serious problems in the production system, discipline will result in one (or more) E-mail problems being dealt with first. But obviously, problems with production systems should be dealt with first.
Thorne warns that defining an SLO is not enough. The misbudgeting strategy sets explicit operational rules (rather than monetary compensation) to achieve the SLO before the system approaches the THRESHOLD of the SLO. SLO also minimizes operational and development conflicts when the system fails to meet user requirements. Thorne notes that “the gap between perfect reliability and an SLO is a false budget.” Google’s typical bad budget policy is to block new features once an app has used up its bad budget (e.g., the 43 minute outage budget this month); Or create a dedicated Sprint based on the corrective actions provided by a post-Mortem analysis.
Thorne stresses, however, that what works for Google won’t work for every organization.
The SLO required by SRE should be able to strike a balance between acceptable failures, necessary costs, and speed of delivery.
Accurate SLOs and policies must be tailored to specific companies (not copy and paste Google’s approach), and should be focused on continuously improving the customer experience, rather than setting lofty goals or harsh penalties that may backfire.
Thorne gave an example in his talk of a company trying to reduce the processing time of a recommendation system. On average, users didn’t see the recommendations until they returned six hours later. A proper SLO should process all recommendations within six hours, which means saving three part-time engineers from solving the “problem” of slow response times.
Thorne brings up a third key issue with SRE, which is that SRE teams should be able to balance the workload of day-to-day (often unplanned) operations and planning work to reduce human effort (also known as “fire fighting”). In Google, this means that at least 50% of SRE should be used for project work, including evaluating the architecture of the new system as early as possible, finding out the resiliency anti-pattern, and avoiding more work later. Improve monitoring, automate repetitive tasks, or coordinate corrective actions after failures.
Thorne further identifies some of the negative patterns of SRE. For example, simply renaming the operations team as the SRE team, or simply hiring SRE engineers, without first implementing SRE principles and mechanisms (SLOS, misbudgeting policies, and balancing workloads).
Thorne believes there are five key steps to successful SRE implementation:
-
Customer focused SLO based on scenario definition;
-
Define a reasonable misbudgeting strategy;
-
Hire (internal or external) SRE personnel and empower them with the support of leadership;
-
Support SRE tuning SLO and enforce bad budget strategy;
-
Assign responsibility for the reliability of mission-critical systems to the SRE team and responsibility for other systems to the appropriate development team.
Thorne’s talk PPT download link: https://github.com/devopsenterprise/2018-London/blob/master/Tuesday/Breakout%20Sessions/Throne%2C%20Stephen%2C%20Getting %20Started%20with%20Site%20Reliability%20Engineering.pdf