Author’s brief introduction
The way of the massive operations, operational planning, the author, on the mass operations, operational planning, I think there is no accurate definition of the industry, if say that the Internet can architects use the design how tall skyscrapers to measure the abilities of architecture, the operations, operation’s more of a focus on the quality of Internet service, efficiency, cost, failure, the bottleneck, the user’s patience, complaints and other issues.
In the following days, with quality, efficiency and cost as the core, I will share my experience with you from the aspects of operation planning, management, process/specification, system/platform, monitoring, alarm, safety, optimization, assessment and other cases. The content is roughly as follows.
Editor’s note: A good system is operable and enforceable, not suspended. The situation of each company is different, and the system needs to be modified regularly according to the company’s own situation. The following article is a template of the system for reference only, and it definitely needs to be modified before it is used.
The body of the
Internet products provide 7 * 24 hours service, but because of artificial operation, program BUG causes such as service is not available is the important factor affecting the service continues to run, in order to improve the quality of the operations and operations of the business products, standardize various lines of business services, fault response, to formulate and publish “punishment of fault classification and specification is very necessary.
Fault classification standard
In operation failure, failure caused by non-force majeure shall be classified as “failure”. For failure, fault classification, fault responsible person and fault handling result shall be investigated. The following is the definition and explanation of each type of fault level. As the fault may affect many aspects, the principle of comprehensive fault level evaluation is that the fault with the highest severity level in each aspect is the comprehensive fault level, as shown below.
Fault scale
Fault classification | level | Service Fault Description |
Business availability class | Primary failure | Services are interrupted for more than 8 hours |
The secondary fault | Services are interrupted for 2 to 8 hours | |
Level 3 failure | After services are interrupted for 1-2 hours, core services are unavailable | |
Level 4 fault | If services are interrupted for less than 1 hour, core service functions are affected | |
Five failure | If the service is interrupted for less than 1 hour, minor service functions cannot be used | |
Business Security class | Primary failure | System intrusion: core business is invaded, core user data is invaded, or system files are maliciously tampered, which is easy to cause intrusion diffusion; |
Page tampering: portal homepage to illegal tampering content, content involving great harm; | ||
CGI vulnerability: has caused a large number of users to discuss, spread and with it to harm the interests of the company’s brand, or cause direct economic losses | ||
The secondary fault | System intrusion: the core business is invaded without endangering important data, but it only causes potential spread but no other machine system is found to be invaded; | |
Page tampering: Business pages for illegal tampering, or pranks; | ||
CGI vulnerabilities: discovered by outsiders but not yet causing a major crisis or loss of economic interest | ||
Level 3 failure | System intrusion: Core services have high-risk ports or system vulnerabilities | |
CGI vulnerability: a core system vulnerability that has been discovered internally but has not caused a major crisis or loss of economic interest | ||
Level 4 fault | System intrusion: Non-core services have high-risk ports or system vulnerabilities | |
CGI vulnerability: a common system vulnerability discovered internally that has not yet caused a major crisis or loss of economic interest | ||
Five failure | Hidden danger: there are loopholes, but no significant consequences |
Fault reward and punishment system
Operation fault handling evaluation is a comprehensive evaluation of fault handling based on the response, handling, completion results and other factors of the relevant responsible person to the fault. The department will adjust the fault punishment level according to this evaluation. This rating is only used for failure penalty ratings determined within the department and is not subject to the company’s penalty regulations. If the following conditions are met, the fault penalty level can be appropriately downgraded, which is decided by the department leader. The fault escalation system is shown below.
Fault upgrade system table
Assessment item | Downgrade standard | To upgrade the standard |
The response time | Immediate response, including failure notification, handling, aftermath and other matters | After repeated urging by relevant personnel, the responsible person still failed to deal with the fault in time |
readiness | Adequate preventive mechanisms are in place to prevent the causes of failures | Failure to prevent or avoid existing problems or low-level errors |
Handling attitude and ability | Handle faults in the fastest time, and actively cooperate with other related personnel in troubleshooting work; Actively seek solutions and resource support for technical problems; | Do not pay attention to the fault, attitude neglect, perfunctory; Or you do not have sufficient skills to troubleshoot faults |
The processing results | The system recovers to normal operation in the shortest time and the impact of the fault is minimized | The fault is not completely resolved; Or because the processing process is not timely and improper to lead to the failure of the impact (scope, amount, complaints, malignant public opinion, etc.) has expanded |
The follow-up | Summarize the causes of failure, and formulate preventive and evasive measures for similar failures | Refuse to summarize the cause of failure (other than force majeure) and formulate preventive/evasive measures |
The emerging at all levels for the operation failure, if the main cause of operation failure caused by human neglect work/error, refer to the following relevant punishment punishment standard for individual and team, any operation failure, want to notice relevant leaders or the related processing personnel, for the delay, conceal faults, will be strictly punished, fault classification and punishment as shown below.
Fault scale
level | Personal punishment |
Primary failure | Company-level failure penalty shall prevail (company-wide notification, even dismissal) |
The secondary fault | Company-level failure penalty shall prevail (company-wide notification, even dismissal) |
Level 3 failure | All product lines and related groups will be criticized and fined 2,000 yuan |
Level 4 fault | All product lines and related groups will be criticized and fined 1000 yuan |
Five failure | All product lines and related groups are notified of criticism |