What should I do when I receive an online question
First rule for handling online problems: Ensure the availability of online services and reduce the impact of faults in the first place.
Mobile personnel handling process:
- Synchronize yourself within the problem group as soon as possible to follow up on the problem
- Evaluate the impact scope of the problem, confirm the severity of the problem, consider whether to roll back and go through the degradation process, and report the results to the R&D owner
Current monitoring platforms: Prajna real-time data monitoring platform, CAT real-time error log platform and GALAXY log platform.
Prajna can check the data in real time. Ideally, there is a correlation field value for each indicator, so you can check the abnormal data of the day
CAT can observe the abnormal error data of the current day, and compare the data between colleagues on the previous day and the same day last week to see whether there are abnormal errors.
GALAXY can collect the data of the previous day, and can pull the data of colleagues on the previous day and the same day last week to simulate the data of the day. - Analyze and locate the problem and update the progress in the problem group
Analysis of positioning problems:
The following operations need to be synchronized:
How can data be used to help locate problems
1. Check the number and content of each error log on the project panel of PRAjna or CAT
2. If the token of the user is obtained, the user’s row is retrieved from the log system (Prajna, CAT, or back-end log) to recover the process and scenario of the problem
Problem handling process of the R&D owner:
- Decide whether to roll back or degrade based on the scope of impact feedback from mobile personnel and inform product and QA
- Assign personnel to solve analysis and location problems and synchronize staffing to the problem group
- Perform the fault upgrade process. 1. If the designated troubleshooter cannot locate the fault within 10 minutes, arrange all troubleshooters to troubleshoot the fault. All staff can not locate within 10 minutes, the R&D owner will report the current situation to the upper management
- After the problem is located and resolved, prepare to resume work on the line. A. If the rollback scheme is adopted, regression the Bugfix branch through QA testing, set the date for re-submission, and execute the normal release process B. If the degradation process is adopted, QA will be notified to do regression test after online recovery, and the developer will continue to observe abnormal data on the monitoring system for half an hour. If the data is stable, the problem will be confirmed as solved.
5. Arrange related personnel to complete casestudy and follow-up after the online problem is solved.
Small program common problems troubleshooting method
Login prompt “service error” will appear on the navigation page, payment confirmation page, shopping cart page. Check whether the HTTP status code of the login request is 403. If the status code is 403, the login request is controlled by risk. If the status code is 403, the account center returns an exception
Payment prompt “This transaction is abnormal, please try again later”. Users are controlled by wechat risk
The cache problem should not appear in the data, after closing the page, within 5 minutes to open a new page with the data, it is likely to be the wechat cache problem.
The payment prompt service error is likely to be a health check problem. It is recommended to pull the backend to view related logs
The message wechat login failed is displayed. The access to the account center times out
Communication methods during troubleshooting
When communicating with a third party for troubleshooting, pay attention to timely and complete communication of information. In addition to feeding back the problem phenomenon, you must also feedback the analysis data of the problem to the third party so that the third party can be familiar with the problem as soon as possible.
(In the case shown in the figure, the data of our analysis was not directly fed back to the third party, nor was it the first communication obtained)
If there is no reply from the third party, contact the third party directly by phone to solve the online problem more quickly. It is a must for both the business and the third party. Do not worry about it.
If the person in charge of the third party cannot give due feedback, contact the team leader in time to find someone familiar with the business as soon as possible.