background
It’s been a while since we used Cronitor to monitor scheduled tasks and Terraform to “code” our monitoring. In contrast to a resident service, monitoring a scheduled task is often concerned not with whether the process continues to run and receives requests properly, but with:
- Does it start at a given time?
- Success or failure?
- Are there any errors during execution?
- Are there any spikes or drops in execution time compared to usual?
In ancient times
A few years ago, we placed a hook for all scheduled task termination events, sending notification of task success or failure to a prepared Slack channel.
Therefore, an engineer on the team needs to check Slack message notifications periodically to determine whether the results of cron jobs are normal. As a result, the failure notification will not be sent to Slack. Therefore, engineers also need to check the number of messages in the channel. If the messages are inconsistent with expectations, they can find out the failed tasks by excluding the successful messages using the exclusive method.
Initially this works because the number of scheduled tasks is very small; As the number of tasks grew, the cost of relying on visual inspection and the probability of human error soared, engineers grew fed up with the repetitive grind.
Also, this workflow is antithetical to DevOps thinking.
The introduction of Cronitor
So we introduced Cronitor, whose core logic went something like this:
- Create scheduled task monitoring items using the UI or API. The value includes the time when cron jobs should start and the tolerable threshold.
- Initiated through the HTTP API when a task starts, ends, or an exception occurs
/ping
The request is reported to Cronitor. - If the Cronitor does not receive a Start ping or finish ping or a failed ping at the expected time, the task is considered to have failed. The other cases are similar.
- Notify failures or errors to Slack or Webhook, etc.
Because Cronitor only sends failure notifications, the engineers’ job changes from looking at a flood of messages and finding failures to just looking for failures, significantly reducing the human burden.
Cronitor supports Integrations such as Slack, PagerDuty, Opsgenie, VictorOps, etc. We have Slack configured.
In addition, you can see the status of each Cron job at a glance on Cronitor’s Web UI.
One tip: Cronitor also has healthChecks. IO.
Responsibility to the people
At this stage, we have reduced our labor costs. Still, it takes an engineer to follow Slack messages and distribute errors to specific principals. As the company grows, the number of scheduled tasks and the number of colleagues on the development team increases, and the person in charge of reading the message sometimes doesn’t know who to refer to for errors. Therefore, responsibility for each scheduled task must be assigned to a person.
Because Cronitor doesn’t have a “master” concept, we customized our own solution using Cronitor’s Tag and Webhook.
As you can see from the row of tags at the top of the image above, there are several tags named Owner :* that we attach to timed tasks. At the same time, the original Slack integration is changed to use Webhook, which sends information about failed tasks to an in-house developed widget called Cronitor Failure Dispatcher. It sends failure notifications to the appropriate Slack channel. The tool reads the tag of the task before sending the notification and mentions the specified person in the notification message, such as @Anny.Wang.
The new problem
The workflow for monitoring the execution results of scheduled tasks is fully automated. You still need to manually create scheduled task monitoring items on the Web UI. This seems easy to do in a small number of Cron jobs, but most of our projects need four operating environments: Development, staging, UAT and Production. Therefore, we have to create four monitoring items corresponding to the already large number of scheduled tasks. This means:
- Heavy workload, and the possibility of human error (such as the same monitoring item of each environment configuration is not unified); In particular, over time, the monitoring configuration created later is likely to be slightly different from the one created earlier, which can be difficult to verify.
- Not traceable. The web UI lacks operation logs. Therefore, it is difficult to find the initiator and cause of the change when configuration changes occur.
- It is almost impossible to implement global changes (such as adding an environment, changing the configuration of all monitoring items, etc.).
The introduction of Terraform
Inspired by the idea of Infrastructure as Code, we wondered: Could we implement Monitors as Code? The answer is yes.
We’ve had some practical experience with Terraform as we’ve codified our infrastructure for a long time. Designed to allow users to declare almost any “resource” — by writing providers, defining the properties of the resource, and adding, deleting, modifying, or modifying them, Terraform helps manage those resources defined in code, such as checking the diff and performing necessary actions based on the diFF.
Providers interact with specific resource providers, such as calling HTTP apis, modifying database records, or even editing files.
Since there is no official Cronitor or community equivalent Terraform Provider, we wrote one and opened it: github.com/nauxliu/ter… . Here is a simple example of declaring cronitor_heartbeat_monitor:
terraform { backend "local" { path = "terraform.tfstate" } } variable "cronitor_api_key" { description = "The API key of Cronitor." } provider "cronitor" { api_key = var.cronitor_api_key } resource "cronitor_heartbeat_monitor" "monitor" { name = "Test Monitor" notifications { webhooks = ["https://example.com/"] } tags = [ "foo", "bar", ] rule { # Every day in 8:00 AM value = "0 8 * * *" grace_seconds = 60 } }Copy the code
Tip: Recommend crontab. Guru, a gadget that parses Cron expression into a human-readable language.
To safely commit the above to the code management system, we decoupled the Cronitor API key using Terraform variables. This allows you to both store it in the terraform.tfVars file and set it via environment variables to facilitate CI/CD.
Of course, the actual code is not so simple. We write the monitoring items required by multiple projects into Terraform Modules for reuse, then abstract each environment into Terraform workspaces, and use modules instructions in workspaces. Reference modules as needed; Flexible orchestration even if some of the projects do not exist in certain environments.
Finally, we put all the code in a Git repository and combined it with the GitLab CI to achieve complete automation from code submission to actual changes.
Final workflow
+--------------+
| Users Commit |
+------+-------+
|
v
+-------+--------+
| Git Repository |
+-------+--------+
| +-----------+
| CI/CD <-----> Terraform |
| +-----------+
+-----------+ HTTP +--------+----------+
| Cron jobs +-------------> Cronitor Monitors |
+-----------+ Request +--------+----------+
|
| Webhook
v
+------------+----------------+
| Cronitor Failure Dispatcher |
+--------+----+----+----------+
| | |
+-------+ | +---------+
| | |
v v v
User A User B User C
Copy the code
Welcome to follow our wechat official account “RightCapital”