When we encounter an online problem, the general idea is not to immediately locate the cause of the problem, but to restore the system to a normal state and minimize the damage.
Here are some ways to prevent or solve online problems (including but not limited to)
Automated testing
Perform automated tests periodically
Daily inspection
Check can be performed manually or by the system itself
On duty
There are appropriate people who can deal with problems in time
Fault self-healing system
In fact, for a particular problem, human actions can be scripted to trigger the corresponding failure to heal itself
capacity
For insufficient capacity
restart
Some bugs are not necessary, restart can quickly restore normal use
The rollback
Go back to the last stable version
demotion
If the database is inaccessible, use caching
fusing
If there are too many failures to access external services, then stop accessing them for the time being
Current limiting
Limit the frequency of access between systems
isolation
Minimize the impact
Backup/failover
Take backup to provide service
The alarm
An alarm is generated if an exception occurs
monitoring
Enhance system observability
analyse
Summary of problems and future prevention
test
Conduct adequate testing prior to release, including unit testing, integration testing