When we encounter an online problem, the general idea is not to immediately locate the cause of the problem, but to restore the system to a normal state and minimize the damage.

Here are some ways to prevent or solve online problems (including but not limited to)

Automated testing

Perform automated tests periodically

Daily inspection

Check can be performed manually or by the system itself

On duty

There are appropriate people who can deal with problems in time

Fault self-healing system

In fact, for a particular problem, human actions can be scripted to trigger the corresponding failure to heal itself


For insufficient capacity


Some bugs are not necessary, restart can quickly restore normal use

The rollback

Go back to the last stable version


If the database is inaccessible, use caching


If there are too many failures to access external services, then stop accessing them for the time being

Current limiting

Limit the frequency of access between systems


Minimize the impact


Take backup to provide service

The alarm

An alarm is generated if an exception occurs


Enhance system observability


Summary of problems and future prevention


Conduct adequate testing prior to release, including unit testing, integration testing