Thinking of Operation and Maintenance Fault Management (including open source fault management system)

Fault detection, fault prevention, fault treatment, fault self-healing, eliminate fault handling

Why do fault management

A fault refers to a fault that deteriorates user experience or makes functions unavailable due to service unavailability, instability, or service performance degradation in the production environment. Murphy’s Law tells us:

Nothing is ever as simple as it seems
Everything takes longer than you think
Anything that can go wrong will go wrong
How do you worry about something happening, so it’s more likely to happen

No matter how small the probability of failure is, if it can occur, it will happen. Similarly, Hine’s law also warns us: behind every serious accident, there must be 29 minor accidents, 300 near-failures and 1000 potential accidents. That is to say, behind any serious accident there are many small problems accumulated, which will lead to qualitative changes and serious problems when it reaches a certain level. Therefore, in order to ensure SLA, to discover and locate faults in advance, to avoid secondary faults, to solve problems such as unclear responsibility boundaries and unclear leading improvement, and even to self-heal faults, we need a standardized fault management principle that can be followed to reduce the impact on projects

Fault Management Objectives

Reduce faults and improve troubleshooting efficiency
Enhance the stability of online products and improve SLA
Summary of operation and maintenance problems as a knowledge base
Improve fault detection and monitoring
It provides a basis for self-healing of faults

Fault classification standard

In order to measure the scope and degree of influence, a unified judgment standard shall be determined jointly with PM, product and development to avoid shirking responsibility and indifference in the later redisk failure. The fault level is generally based on MTBF(average time between failures, the longer the fault interval, the higher the reliability), MTTR(average recovery time, the shorter the impact), AND MTTF(average failure time, the average normal operation time of the system, the occurrence of a fault; The higher the reliability, the longer the average failure free time) and so on. According to the operation of our game, faults are graded according to the number of players affected and the failure time:

An S indicates a malfunction that affects the player
T is for a glitch that doesn’t affect the player
1, 2, 3 Severity from the greatest to the least

Affect players	The fault range	The fault level
The core business is unavailable (login, game, payment), and >=15% of users are affected online; The affected cluster is unavailable for more than 4 hours	S1 level fault
Core business is unavailable (login, game, payment), and affects users >=5% online <=15%; 1h< affected cluster unavailable time <4h	S2 level fault
Game peripheral features are not available and slightly affect users; Affected game cluster unavailable time <=1h	S3 level fault
Does not affect the player	Peripheral functions of the game are unavailable, but the normal use of users is not affected	T1 level fault
Game peripheral functions available, part of the node server failure	T2 fault level

Fault Management Process

Inform the project quality assurance group after confirming failures through player feedback, monitoring alarms and planned changes (such as discontinued version updates)
O&m preliminarily determine the fault phenomenon, scope and cause, and inform development and DBA whether to intervene
Determine the processing priority based on the impact of the fault
Locate and rectify faults
After fault recovery, if there is a major fault, the development, operation and maintenance, DBA and others will analyze the fault recovery
Improvement plan, need to improve the monitoring, emergency measures
The FMS fault management system records the troubleshooting process and improvement measures

Fault analysis Report template:

Fault self-healing

Abstract detection script for unknown faults. When secondary fault alarms are encountered, Zabbix remotely executes related processing logic. Can refer to the practice of blue whale, self-healing as a package to consume

FMS fault management system

1. Function modules

According to the above fault management ideas, the FMS fault management system is developed, and the function points are as follows:

2. The nude

3. The FMS project

https://github.com/geekwolf/fms have any good Suggestions welcome to issue ~