When we encounter an online problem, the general idea is not to immediately locate the cause of the problem, but to restore the system to a normal state and minimize the damage.

Here are some ways to prevent or solve online problems (including but not limited to)

Automated testing

Perform automated tests periodically

Daily inspection

Check can be performed manually or by the system itself

On duty

There are appropriate people who can deal with problems in time

Fault self-healing system

In fact, for a particular problem, human actions can be scripted to trigger the corresponding failure to heal itself

capacity

For insufficient capacity

restart

Some bugs are not necessary, restart can quickly restore normal use

The rollback

Go back to the last stable version

demotion

If the database is inaccessible, use caching

fusing

If there are too many failures to access external services, then stop accessing them for the time being

Current limiting

Limit the frequency of access between systems

isolation

Minimize the impact

Backup/failover

Take backup to provide service

The alarm

An alarm is generated if an exception occurs

monitoring

Enhance system observability

analyse

Summary of problems and future prevention

test

Conduct adequate testing prior to release, including unit testing, integration testing