As DevOps inevitably encounter various alerts and incidents, how to efficiently and reliably manage these incidents has become an inevitable topic in the DevOps workflow. This article introduces our Incident workflow and some of the lessons learned during the practice.

Incident source

There are many sources of Incident, such as manual report by customer service team and alert sent by monitoring system. Most of our incidents come from the latter, while alert comes from white box, black box, and infrastructure monitoring.

White box monitoring

So-called white-box monitoring focuses on metrics within the application, such as requests per second. The goal is to nip accidents in the bud before they occur that users can detect.

We used a combination of Prometheus and Alertmanager. Prometheus collects internal indicators of applications based on exporters, calculates alarm rule expressions, and sends alarms that comply with the rules to the Alertmanager.

Node Exporter <---+ Kube-state-metrics <---| +------------+ Nginx exporter <-------+ Prometheus | Redis exporter <---| + -- -- -- -- -- - + -- -- -- -- -- +... <---+ | | +------+ | +-------+ | ^ v v | | Evaluate alert rules | +----------+-----------+ | Send alerts if | any condition hits | | +-------v------+ | Alertmanager | +-------+------+ | v +--+--+ | ? | +-----+Copy the code

We have combed through the use of Prometheus with Alertmanager in detail in previous articles and will not expand on that in this article.

Black box monitoring

Black-box monitoring focuses more on “ongoing events”, such as high application response times. They are often consistent with the user’s role and with what the user can observe. If possible, report problems as soon as they occur, rather than waiting for feedback from users.

We used to use StatusCake as a probe to periodically initiate requests to our service. If timeout occurs, connection is rejected, and the response does not contain specific content, an alarm will be triggered. However, we found that part of the monitoring stopped unexpectedly for more than an hour. At this time, there was no alarm about the service downtime, and StatusCake officials did not inform the problem in time. Subsequently, We confirmed with StatusCake Support that it was a bug:

Hi there,

We’ve since resolved this issue and we have credited a half month’s credit to your account to be deducted from the next subscription cost. The issue at the time was related to an error on a single testing server.

My apologies for the inconvenience here and won’t see that happen again!

Kind regards,

StatusCake Team

We weren’t sure if there had ever been a similar incident that we hadn’t noticed before, and given the poor user interaction experience, we decided to completely abandon StatusCake and switch to Pingdom.

                 +---------+
                 | Pingdom |
                 +---+--+--+
                     |  |
Continuously sending |  | Send alerts if
requests to probe    |  | the services are
the black box        |  | considered down
              +------+  |
              |         |
       +------v---+    +v----+
       | Services |    |  ?  |
       +----------+    +-----+
Copy the code

You can see that we created a lot of monitoring items, covering as much of each project, each environment, and each individual component as possible, and let Pingdom help us keep track of the status of each service.

For the monitoring of scheduled tasks, we choose Cronitor, which has been introduced in detail in the previous article.

We use Terraform to manage all the above monitoring items codified. At the same time, we strongly recommend the use of third-party services as black box monitoring, rather than independent construction based on cloud services such as AWS and Azure. StatusCake, for example, makes it clear in the help document that it has a large number of partner vendors and never uses AWS or Azure nodes to avoid single points of failure. We believe that black box monitoring is a very important part of the Incident workflow and we need to do our best to avoid monitoring failures due to large-scale failures of cloud service providers. It’s very, very unlikely, but it’s not unheard of, for example when AWS S3 went down and the Status Dashboard couldn’t be updated because the warning icon was stored on the service that went down.

Infrastructure monitoring

In addition to white-box monitoring of applications and black-box monitoring of services, infrastructure monitoring is also an important part. Our main cloud service provider is AWS, which uses AWS managed services such as RDS, SQS and ElastiCache. Therefore, we also need to pay attention to their indicators and alarm information. In this part, we adopted the official Amazon CloudWatch of AWS.

Infrastructure alarms are rare due to limited space. This section does not cover them in detail.

Incident notification

After an Incident occurs, information about it should be notified to DevOps. We currently define two main ways as “exits” from Incident.

Slack

As our company’s internal instant communication tool, Slack also undertakes a large number of message notifications, including GitLab and Cronitor. We created a separate channel for alerts called # OPs-Alerts.

The DevOps team changed the Channel’s Notification Preferences to Every New Messages, The Preferences -> Notifications -> My Keywords can be used to ensure that Notifications can be delivered to iphones and iMacs in a timely manner.

On-call

The working mechanism of on-call in China is not very common. Simply put, it means to be ready to answer the “alarm notification call” at any time.

While DevOps engineers pay attention to Slack notifications as much as possible, there are other times when they may not be able to receive alerts in a timely manner, such as:

  • Non-working hours,
  • Missed information,
  • Other issues are being dealt with.

If an urgent alarm, such as Production Down, occurs, it needs to be handled immediately. Therefore, we configured on-call, and DevOps engineers took turns to be on duty. According to on-call schedule, incident with a higher priority was notified to people through text to voice (TTS) calls.

In addition, the DevOps team adds the phone number of the alarm to the iPhone contact list and turns on Emergency Bypass so you don’t have to worry about accidentally hitting the mute switch or missing the call in do not disturb mode.

Connect sources and notifications and plan workflows

Opsgenie is Atlassian’s Incident management product. It offers a wide range of Integrations, from monitoring systems like AWS CloudWatch and Pingdom to messaging tools like Slack and RingCentral. This allows it to receive alerts from almost any source, create incidents, and notify those incidents in a variety of ways to the person in charge.

For example, some of the products mentioned above:

In addition, after the engineer receives the alarm notification, he/she should acknowledge it immediately.

If no acknowledge occurs within a specific time, the alarm will escalate, sending the notification to a higher level person.

Opsgenie also comes with a clear dashboard. It provides detailed incident information when DevOps is notified; In addition, since alerts from multiple sources are aggregated on the same page, this helps DevOps understand the macro state of the entire system at this time, which helps to identify the root cause of the problem.

We also give different priorities to the alerts from different sources and tags, such as P1 and P3. The two Watchdog alarms in the preceding figure are of P3 priority. P1 priority incidents will be notified On call, and other priorities will be notified via Slack messages.

+--------------+  +---------+  +------------+
| Alertmanager |  | Pingdom |  | CloudWatch |
+------+-------+  +---+-----+  +---------+--+
       |              |                  |
       |              |                  |
       +-----------v  v  v---------------+
                +-----+------+
                |            |
                |  Opsgenie  |
                |            |
                +----+--+----+
                     |  |
                   P1|  |P3
               Alerts|  |Alerts
                     |  |
     +---------+     |  |     +-------+
     | On-call | <---+  +---> | Slack |
     +---------+              +-------+
Copy the code

conclusion

That’s our Incident workflow. Please share your experience with us.


Please follow our wechat official account “RightCapital”