background
The landing of SLO alert is an important part of the systematic construction of SRE, which is translated from the SLO alert method of Google team
SRE awareness see: Geek Time SRE Notes
Translation from the original: https://sre.google/workbook/alerting-on-slos/ author: Steven thurgood with Jess Frame, Anthony Lenton, Carmela queenie, Anton thor's and Nejc TrdinCopy the code
This article will show you how to apply SLO Settings to alerts in engineering practice to respond to major events. Both our first SRE and this book discuss implementing the SLO. We believe that an SLO that sets the platform’s reliability can provide on-Call engineers with reliable guidance. Here, we provide specific guidance on how to turn these SLOs into alert rules so that you can respond to problems before you find yourself spending too much of your budget incorrectly.
Our example presents a series of increasingly complex alert rules, one by one looking at their strengths and weaknesses. Although our example uses simple request-driven services and Prometheus syntax, you can apply this approach to any alert framework.
Warning Precautions
To generate alerts based on service Level Indicators (SLI) and error budgets, you need a way to combine these two factors into specific rules. Your goal: to alert you when an incident consumes a large chunk of the wrong budget.
When evaluating an alert strategy, consider meeting the following attributes:
accuracy
The percentage of significant events detected is high, with 100% accuracy if each alert corresponds to a significant event. Note that alerts can be particularly sensitive to unimportant events during low traffic periods (discussed in low traffic service and error budget alerts).
The recall
Percentage of significant events detected. If every significant event causes an alarm, the recall rate is 100 percent.
Testing time
How long it takes to send alerts under various conditions, longer detection times can negatively impact error budgets.
Reset the time
How long after the problem has been resolved to trigger an alert, long reset times can lead to confusion or problems being ignored.
Methods for triggering alerts for important events
Building alert rules for your SLO can get very complicated. Here, we suggest six ways to configure critical event alerts that provide good control over four attributes: accuracy, recall, detection time, and reset time. Each of the following approaches solves a different problem, and some end up solving multiple problems at once. The first three non-viable strategies can be combined with the last three viable alerting strategies, with option 6 being the most feasible and strongly recommended option. The first approach is simple to implement but inadequate, while the best approach provides a complete solution that guarantees SLO in the long and short term.
For the purposes of this discussion, “error budgets” and “error rates” apply to all SLIS, not just those with “error” in their names. We recommend using SLI to calculate the ratio of successful events to total events. The error budget gives the number of error events allowed, and the error rate is the ratio of error events to total events.
Method 1: Error rate ≥ SLO threshold
The simplest solution is that you can choose a smaller time window (say, 10 minutes) and raise an alert when the error rate on that window exceeds the SLO.
For example, if the SLO for a 30-day period is 99.9%, an alert will occur when the error rate for 10 of those minutes is greater than 0.1% :
Alert: HighErrorRate expr: job: slo_errorS_per_request :ratio_rate10m{job="myjob"} >= 0.001Copy the code
note
This 10-minute average is calculated using the Recording rule in Prometheus: -Record: job:slo_errors_per_request: ratio_rate10M-expr: Sum (rate(SLO_errors [10m])) by (job)/sum(rate(slo_requests[10m])) by (job) If you do not export SLO_errors and slo_requests from the task, You can rename the indicator to create a time series: record: SLO_errors expression: http_errorsCopy the code
When the latest error rate is equal to the SLO, the system detects that the consumption of the error budget reaches:
Alert window size/alert periodCopy the code
The following figure shows the relationship between the error rate and detection time (assuming a 10-minute alert window and 99.9% SLO conditions) :
There are advantages and disadvantages when suddenly there is a high error rate:
advantages
Good detection time: total interruption time is 0.6s
Any event that threatens the SLO will trigger this alert, showing a good recall rate.
disadvantages
Accuracy is low: every 10-minute interval may trigger an alert, even if the warning does not threaten the 0.1% error rate in the SLO. And each alert consumes, at most, 0.02% of the monthly error budget.
To take an extreme example: You can get up to 144 alerts per day, and the total SLO is guaranteed even if you do nothing.
Method 2: Increase alert time window
The problem with the previous approach is that the alert period is short, so we can try to increase the period so that the alert is triggered only when a large error budget is consumed.
In order to keep the alarm rate manageable, the alarm needs to be triggered when the 36-hour window consumes 5% of the 30-day error budget:
- alert: HighErrorRate
expr: job:slo_errors_per_request:ratio_rate36h{job="myjob"} > 0.001
Copy the code
In this case, the detection time is:
((1-SLO)/ Error rate)* Alarm time windowCopy the code
In a large alert window, there are advantages and disadvantages when the error rate is high:
advantages
Better detection time: it takes 2 minutes and 10 seconds for the alarm to be triggered when the service is completely interrupted
Better accuracy than method 1: The error rate lasts longer, and the alarm generally poses a significant threat to the SLO.
disadvantages
Poor reset time: In the event of a complete outage of service, an alert would be triggered after 2 minutes and continue to be triggered for another 36 hours
A larger time window means that a large number of reported data points need to be counted, resulting in higher computation costs (memory, I/O costs)
The figure below shows a situation in which a high error rate occurs for a short period of time and drops rapidly to negligible status in a 36-hour alarm window, but the resulting flat error rate for the 36-hour period is still above the alarm threshold:
Method 3: Increase alarm duration
Most monitoring systems allow you to set a duration in the alert’s rules. If an alarm is triggered only when the error rate is above a threshold for that duration, you can consider using this low-cost method to increase the alert window
- alert: HighErrorRate expr: job: slo_errorS_per_request :ratio_rate1m{job="myjob"} > 0.001 for: 1hCopy the code
The pros and cons are shown below, namely setting the duration parameter for the rules of the alert:
advantages
The alarm is more accurate: the error rate has to last for a specified period of time to trigger the alarm, indicating that it is more likely to be a significant event
disadvantages
Low recall rate and detection time difference: Because the set duration does not vary with the severity of the incident, 100% of service outages will be alerted after 1 hour, detection time is the same as 0.2% outages. A 1-hour 100% outage consumes 140% of the 30-day error budget
(Therefore, we do not recommend using duration as part of the SLO based alert criteria)
The figure below shows that with a 10-minute duration and a 5-minute alert window, the service is completely out of service for 5 minutes in every 10 minutes without triggering an alert, although the error rate consumes 35% of the 30-day error budget.
(Each spike above consumed nearly 12 percent of the 30-day error budget, but the alarm never went off.)
Method 4: Burn Rate alarm
To improve on these methods, you need an alarm method that has both good detection time and high accuracy. To meet this requirement, it is recommended that you use burn rate monitoring to shorten the alarm time window while keeping the error budget unchanged.
The burnout rate is related to the SLO and depends on how fast the service consumes the incorrect budget. The figure below shows the relationship between burn rate and false budget.
(Figure: Relationship between burnout rate and false budget)
The example above assumes SL0=99.9% and an alert window of 30 days. For Burn rate=1 (i.e., a persistent error rate of 0.1%), this means that the error budget will consume zero at the end of the window.
The following table shows the relationship among burnout rate, error rate, and consumption of the wrong budget time:
Assuming that the alarm window is fixed at one hour and the alarm notification is triggered when 5% of the incorrect budget is consumed, the alarm burn rate can be calculated.
Based on the burnout rate alarm, the time required to trigger the alarm is:
((1-SLO)/ error rate) * Alarm window size * burnout rateCopy the code
When the alarm is triggered, the error budget consumed is:
(Burnout rate * alarm window size)/ time required to trigger the alarmCopy the code
With burn rate =36 and one hour consuming 5% of the 30 days’ false budget, the alarm rule becomes:
- alert: HighErrorRate expr: job: slo_ERRORS_per_request :ratio_rate1h{job="myjob"} > 36 * 0.001Copy the code
Advantages and disadvantages of burnout rate alarm:
advantages
High accuracy: alerts are triggered only when the wrong budget expenditure is large
Shorter alert time window and lower computational cost
Good detection time
Better reset time: 58 points
disadvantages
Lower recall rate: a 35-fold burn rate never triggers an alarm, but consumes 30 days’ worth of false budgets in 20.5 hours.
Method 5: Multiple burnout rate alarms
Multiple burn rates and alarm Windows can be configured in alarm rules to trigger an alarm as long as one of the burn rates reaches an alarm threshold. This approach preserves the benefits of burn out alarms and guarantees that low error rate scenarios will not be missed.
It is a good practice to set up notifications for events that are not easily noticed but consume the service error budget (such as an event that consumes 10% of the error budget in 3 days). This error rate captures significant events, but no notification is required because the budget burn rate provides enough time for the event to be resolved.
We recommend 2% budget consumption for 1 hour and 5% budget consumption for 6 hours as reasonable starting points for notifications, and 10% budget consumption for 3 days as a baseline for a work order alert. The appropriate number depends on the service and baseline load. For higher-load services, you may want to set up a 6-hour alert window to ensure on-call demand on weekends and holidays.
The following table shows the relationship between burn out rates, alarm Windows, and different percentage error budget expenditures:
The alarm configuration looks something like this:
expr: (
job:slo_errors_per_request:ratio_rate1h{job="myjob"} > (14.4*0.001)
or
job:slo_errors_per_request:ratio_rate6h{job="myjob"} > (6*0.001)
)
severity: page
expr: job:slo_errors_per_request:ratio_rate3d{job="myjob"} > 0.001
severity: ticket
Copy the code
The following figure shows the relationship between detection time, alarm type and error rate:
Depending on your requirements for different alarm response times, you can configure multiple burnout alarms to meet different priorities. If an event consumes the wrong budget within hours or days, notification should be proactively sent. Otherwise, it is more appropriate to process the alert based on the alert notification for the next working day.
The pros and cons of using multiple burnout alarms are listed below:
advantages
Can be configured separately for events of different severity: alerts are required if the error rate is high, and will eventually be triggered when the error rate is low but persistent
Good accuracy, as with all alarm methods on a fixed budget
Good recall rate: 3 days warning time window can be set
You can choose the most appropriate alert notification: to ensure that SL0 meets different response speeds
disadvantages
More complex, need to manage and do more calculations (such as numbers, alarm Windows and thresholds)
Because there is a 3-day time window, this can result in a long reset time
To avoid triggering multiple alarms at the same time (when multiple conditions are met), you need to enforce alarm limits. For example, 10% of your budget within 5 minutes also matches 5% of your budget within 6 hours, including 2% of your budget within 1 hour. This situation triggers three notifications unless the alert system is smart enough to prevent it
Method 6: multi-window, multi-burnout alarm
We can iterate on method 5, which notifies us only when the budget continues to be consumed, thus reducing the number of false positives. To do this, we need to add a parameter: a short window to check if the wrong budget is still being consumed when the alarm is triggered.
A good guideline is to make the short window 1/12 the length of the long window, as shown in the figure below. The figure below shows two alarm thresholds. After experiencing a 15% error rate for 10 minutes, the average error rate for short Windows immediately exceeds the alarm threshold, while the average error rate for long Windows exceeds the threshold after 5 minutes, when the alarm starts to trigger (i.e., both Windows must meet the condition). The average value of short Windows drops below the threshold 5 minutes after the error rate stops, and the average value of long Windows drops below the threshold 60 minutes after the error rate stops.
For example, a Page-level alert could be sent if a false burn rate greater than 14.4 times occurred in the previous hour and five minutes. This alert is triggered only when 2% of the incorrect budget has been consumed, but stopping triggering after 5 minutes rather than an hour is a better time to reset.
expr: (job:slo_errors_per_request:ratio_rate1h{job="myjob"} > (14.4*0.001) and Job :slo_errors_per_request:ratio_rate5m{job="myjob"} > (14.4*0.001)) or ( Job: slo_errors_per_request: ratio_rate6h {job = "myjob}" > (6 * 0.001) and Job :slo_errors_per_request:ratio_rate30m{job="myjob"} > (6*0.001)) Severity: page expr: (job:slo_errors_per_request:ratio_rate24h{job="myjob"} > (3*0.001) and Job :slo_errors_per_request:ratio_rate2h{job="myjob"} > (3*0.001)) or ( Job :slo_errors_per_request:ratio_rate3d{job="myjob"} > 0.001 and job:slo_errors_per_request:ratio_rate6h{job="myjob"} > 0.001) severity: ticketCopy the code
We recommend the following parameters as starting values for the SLO alert configuration:
We found that alarm configuration based on multiple burnout rate is a powerful way to optimize SL0 alarm.
The pros and cons of this approach are shown below:
advantages
The alert framework supports flexible configuration, allowing you to configure different types of alert notifications based on team needs and event severity
Good precision, as with any fixed error budgeting method, is an advantage
Good recall rate because there is a 3 – day alert window
disadvantages
The need to configure a large number of alarm parameters can be difficult to manage.