Introduction: Alarm is a daily requirement of a company. In addition to meeting the alarm requirements of infrastructure monitoring (CPU, memory, disk, etc.) during operation and maintenance, some companies also have corresponding alarm requirements on application indicators (such as QPS, RT, etc.) and business indicators (such as GMV/ daily activity, etc.).

The author | Huang Xiaomeng

01 Background

Alarm is a daily requirement of a company. In addition to meeting the infrastructure monitoring alarm (CPU, memory, disk, etc.) in the operation and maintenance process, some companies also have corresponding alarm requirements on application indicators (such as QPS, RT, etc.) and business indicators (such as GMV/ daily activity, etc.).

Less at the beginning of the business development, infrastructure, and a single application form, so tend to be more rough directly deal with this kind of demand, but as the growth of the business, especially the development of live to millions or billions of level, monitoring index will rise exponentially, in this case is put forward a huge challenge for alarm system, How to solve the effectiveness and timeliness of the alarm under this volume has become the top priority of IT governance. In this article, we will start from the volume of monitoring indicators, and explain in detail the various challenges encountered in the alarm system at various stages.

02 Schematic diagram of a routine alarm process

As shown in the figure below, a conventional alarm process mainly includes concurrent check, completeness check, data recovery, threshold judgment and other core links. At the same time, in order to ensure the timeliness of the alarm, basically the whole process will be in the form of a second-level trigger, as follows:

Among them, the alarm background task processing system is the focus of our discussion, and several core processes are described as follows:

  1. Concurrent check: Checks whether the current alarm rules are being executed by other processes or nodes. In this case, some alarm rules take too long to be executed repeatedly or are preempted by other task nodes.
  2. Completeness check: Obtain the completeness time of the data source corresponding to the current alarm rule, that is, the time when the latest data is reported. Because there will be a delay in data collection and reporting of data sources, if the data is not uniform, it will be checked, and it is easy to miss and misreport.
  3. Data query: Obtain the data of this rule from monitoring data, such as the collected log service (for example, ElasticSearch) or basic monitoring indicator storage service (for example, Zabbix and Prometheus).
  4. Data recovery: A policy set by some alarm tasks to deal with the absence of data points. There are three kinds of filling 0, filling and not filling. For example, in the case of zero alarm for business data, we are more inclined to add 0; But for scenarios where the CPU average exceeds 80%, we tend not to supplement it.
  5. Threshold judgment: Determines whether an alarm needs to be triggered based on the obtained data and alarm conditions.
  6. Alarm: The alarm information is sent to the configured person by SMS, notification, or email for subsequent handling.

03 Intra-process scheduling scheme

When there is very little business at the beginning, alarm tasks tend to be small. At this time, the general implementation will perform related operations based on an in-process thread pool. The architecture diagram is as follows:

The “background task processing system” shown above can be run on a single machine very quickly for small-scale scenarios. However, when the volume of traffic continues to rise, a machine has a resource bottleneck. At this time, a knee-jerk reaction is to expand the task processing system and let different servers handle different alarm rules. However, as alarm rules continue to increase, the increasing load can cause the Server to restart or suddenly hang. Therefore, high availability, task idempotent execution, failover and other distributed problems are a complex problem.

04 Distributed Scheduling Solution

If the number of tasks reaches ten thousand levels, a lightweight distributed solution is our goal. The basic idea of distributed scheduling scheme is to schedule tasks through a separate task scheduling center, and the alarm background only performs tasks, that is, the idea of task scheduling and task execution isolation, so that both layers can do a good lateral expansion to achieve the purpose of capacity increase. In service implementation, each alarm rule will generate a scheduled task, so as to ensure that each alarm rule load balance execution. The open source market is full of products like Quartz, XXl-Job, Elastic-Job, etc. Take Quartz, for example, as shown below:

As shown in the figure above, each Quartz Server will load all tasks in full. When the time of each task is up, all servers will snatch the lock through the database, and the Server that grabs the lock will trigger the task to the alarm center.

This architecture solves the problems of distributed scheduling and idempotent execution of tasks, and the execution layer can be extended horizontally, which can run stably in the case of low task quantity.

However, it can be seen from the above architecture diagram that Quartz scheduling is mainly achieved by polling DB and locking DB. At this time, the throughput of the whole system is basically closely related to DB specifications and performance. According to the test, if the triggering of the task scheduling frequency of 1 minute reaches 10,000, there will be an obvious scheduling delay.

Scheduling scheme based on SchedulerX 2.0

1. Advantages of SchedulerX 2.0

SchedulerX 2.0 is a commercial distributed task scheduling platform developed by Alibaba. Compared with open source task scheduling system, it has several advantages:

  • Support massive tasks
  • Self-developed lightweight distributed batch model
  • Visual Task Choreography
  • Commercial alarm
  • Visual Log Service

SchedulerX2.0 infrastructure diagram

Compared with common schemes, SchedulerX2.0 will distribute tasks to different servers, and there is no need to trigger a lock for each task scheduling. There is no interaction with the database and no performance bottleneck.

2. High availability

One of the most common issues in distributed systems is high availability. What happens if a Server in SchedulerX 2.0 dies?

As shown in the preceding figure, each application performs three backups. Zk locks one active Server and two standby servers. If a Server fails, failover is performed and other servers take over the scheduling task.

3. Commercialization alarm

SchedulerX 2.0 currently supports three alarm channels: Pin, SMS and email:

Support task failure, timeout, no machine alarm:

Take the nail alarm as an example, you can receive the alarm in real time:

06 summary

SchedulerX 2.0 has supported the businesses of all business groups in Alibaba Group and experienced the tests of double 11 for many times. Currently, more than 1000 enterprises have been connected to the public cloud and we have sufficient experience in massive tasks and high availability. Clearly, SchedulerX 2.0 is currently one of the best solutions in the field of super-large task scheduling.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.