The authors introduce

Wang Yang is an infrastructure architect in the information technology Department of a fund company. He once worked in the financial cloud department of Ant Financial And the IT information technology department of commercial banks. Areas of expertise: Cloud computing IaaS and PaaS platform planning and construction infrastructure high availability, high performance and disaster recovery design, container (Docker) and microservices, etc.

Not long ago, I shared and introduced the experience and tool comparison of enterprise automatic operation and maintenance. Many of the scenes were my comparison and illustration of the practices of first-tier Internet companies and traditional industries based on my practical experience: How to form automatic operation and maintenance as a whole? How to understand and build automatic operation and maintenance from the perspective of methodology? This article triggered many readers’ feelings and thoughts.

This paper summarizes a series of specific problems and discussion results of automatic operation and maintenance put forward by operation and maintenance enthusiasts for your reference.

I. Risks of automated operation and maintenance platform

Question 1: How to control the risk of automatic operation and maintenance?

First, the essence of all automatic function modules is to fall into the code level, so it is necessary to test the code of automatic operation and maintenance function, which is suitable for the development of project management process;

Second, for some delete or modify class operations, need to consider the double check and rollback scheme, for the operation can not be rolled back (this is actually no different from manual operation);

The third is the gray strategy, can use gray to verify whether the automated operation results and expectations are consistent, if consistent, continue, if not, need to roll back;

Fourth, monitoring cooperation, monitoring system can timely find problematic operations and timely alarm;

Fifth, permission management. Strict permission control is required for those who can operate the automated operation and maintenance platform.

Sixth, the system connected through API needs to have an authentication mechanism.

Question 2: How to control the security and permission of the automated o&M platform?

Personally, attention should be paid to the following aspects:

  • For Web page operations, you can control the rights by adding AD domain roles.

  • In the case of interface calls, there needs to be a corresponding permission module;

  • For the operation and maintenance platform itself, it is necessary to prevent the platform from deleting or modifying production resources without authorization.

  • Periodically perform security scans on the platform to scan for vulnerabilities.

Ii. Planning of automatic operation and maintenance platform

Question 1: How should the construction of automated operation and maintenance be planned?

There is no fixed answer to this question, and there are several steps that need to be context-specific, with the ultimate goal of achieving all end-to-end delivery. Generally speaking, it can be divided into the following stages:

  • Solve the most urgent pain points (usually the biggest pain points of the operation team or the problems of other teams that have been squeezed for a long time);

  • Collect automated operation and maintenance requirements from other IT departments (development and test teams) and schedule internal solutions;

  • After solving the problems on the first two points, the various points are connected in series to eliminate the human work between points;

  • In the initial formation of automatic operation and maintenance chain to check and fill gaps, forming a positive feedback chain.

Question 2: In the construction of automatic operation and maintenance, how to formulate standardized norms?

Standardization needs to be combined with the specific situation of the company. Generally speaking, the following aspects need to be standardized (for reference) :

  • Server Pod standardization, a Pod put several machines, how to connect;

  • Physical machine models, computation-intensive, memory, IO intensive or memory, need to summarize the models of different manufacturers into several standard models;

  • Operating system standardization, including operating system version, operating system kernel parameters, drive letter path, etc.

  • Software installation standardization, including software version, installation path, log path, log cutting, parameter tuning, etc.

  • The software deployment is standardized. Two nodes cannot be deployed on the same physical machine or in the same cabinet to avoid host and rack-level faults.

Question 3: In the actual operation and maintenance environment, how should we develop a complete set of automatic operation and maintenance management plan to support the automatic operation and maintenance work?

To formulate an automatic o&M solution, consider the following aspects:

  • Make clear the purpose of the automatic operation and maintenance plan, which is the guiding ideology of the development of automatic operation and maintenance plan;

  • Define the role of the service object of the automated o&M solution;

  • Define what are the handles for different object roles in the process of automated operation and maintenance;

  • Identify security issues (such as permission refinement, invocation authentication, and operation audit) that need to be paid attention to during the implementation of automatic O&M solutions.

  • Further understand the operation and maintenance needs of other colleagues by means of research;

  • In the plan, it is clear that the construction of automatic operation and maintenance platform is divided into several stages, and the demand is dispersed in these stages.

  • Define the specific method (self-research, outsourcing or secondary development based on outsourcing) for implementing the automated operation and maintenance solution into the automated operation and maintenance platform;

  • In the automatic operation and maintenance scheme, the positive feedback process of the platform in the use process is clarified.

Q4: How many stages should the construction of automatic operation and maintenance be carried out? How should planning be done?

There is no fixed answer to this question, and there are several steps that need to be taken in a specific context, with the ultimate goal of achieving all end to end delivery. Generally speaking, it can be divided into the following stages:

  • Address the most pressing pain points;

  • Collect automated operations requirements from other groups in the IT department (development and test teams);

  • After solving the problems on the first two points, the various points are connected in series to eliminate the human work between points;

  • Check and fill the gaps in the initial formation of automatic operation and maintenance chain.

3. CMDB data collection

Question 1: How to realize automatic discovery during CMDB construction?

CMDB automatic discovery is generally based on the following methods:

  • Obtain relevant information by calling the API interface of the software to be collected, such as VMware and EMC storage.

  • Obtain configuration information through a certain protocol (public or private), such as SNMP.

  • By executing commands on the host and processing the results, such as capturing the information of the middleware on the host;

  • Information is obtained by executing commands from the middleware.

Automatic discovery is generally realized through the combination of the above methods.

Question 2: How to choose CMDB to collect data automatically in the construction of automatic operation and maintenance?

This problem is a little too big. Specifically, in terms of data collection, two aspects should be considered in order to collect comprehensive data of CMDB: The first is the automatic collection capability of CMDB collection tool itself, and the second is that some data need to be manual input through procedures, such as the name of the business system, the person in charge of business system operation and maintenance, the person in charge of development and the person in charge of test, which cannot be collected by automatic collection tools and need manual maintenance.

If you need to build a CMDB system, there are three ideas:

  • Completely self-research, which requires strong team research and development ability, and some people understand the PROCESS of ITIL, automatic acquisition is slow;

  • Direct procurement of commercial CMDB products, the advantage is fast online, automatic collection ability is strong, the disadvantage is that some requirements may not be directly met, need customized development;

  • Based on open source products to do secondary development, such as based on IOP, but automatic discovery ability or to achieve their own, the advantage is that there is a basic available framework.

Question 3: How to ensure real-time and consistency of CMDB data at the same time?

  • Real-time: The real-time performance of CMDB data depends on the automatic acquisition capability of CMDB tools.

  • Consistency: Consistency requires process control and regular data audit operations, which can be achieved by leveraging the capabilities of the CMDB platform.

Iv. Selection of operation and maintenance tools

Question 1: What factors should be considered when selecting automatic operation and maintenance tools?

In the selection of automatic operation and maintenance tools, the author believes that the following aspects should be considered:

  • The maturity of automated operations tools, that is, in the industry audience. Both commercial and open source can be evaluated from this perspective;

  • Whether the functions of automatic o&M tools can meet o&M requirements;

  • If it is to choose open source automated operation and maintenance tools, it should also consider whether the technology stack of the tool matches the technology stack of the company personnel;

  • Are automated operation and maintenance tools well supported in terms of security?

  • The impact of automatic operation and maintenance tools on host performance in the process of work, especially to test the pressure on the server side of the operation and maintenance tool platform when the concurrency is large;

  • Also consider whether the automated operation tools you choose will meet the needs of the company’s technology stack.

Question 2: Planning and integration of operation and maintenance tools in automated operation and maintenance construction?

For now, most companies do have that problem. In my opinion, the main reason for the existence of the problem is the lack of a macro overall plan in the early stage, with each organization acting independently without overall management.

So what to do about the existing status quo? In my opinion, the following things should be done:

  • You need to set up a governance team, including the owners of each existing system, and then a leader as the leader;

  • Each system Owner explains the background of the initial construction of the system and what problems the system can solve now and what problems have not been solved;

  • According to the results of the discussion in the second step, merge the systems that can be merged, get through the data of those that cannot be merged but have overlapping functions, and output them uniformly;

  • Follow-up system creation needs to be planned by the governance team to avoid similar incidents.

Question 3: How to choose automatic operation and maintenance products?

Automatic operation and maintenance involves a wide range of aspects, including resource self-service, monitoring, task scheduling, application publishing, etc. Then the following points need to be considered when choosing products:

  • Sort out their pain points, that is, what is the most urgent problem to be solved at present;

  • Planning: plan what effect to achieve in 3 years;

  • Product maturity of the selected automated operation and maintenance platform (how many cases in the industry);

  • The degree of development of the automated operation and maintenance platform, whether it can carry out secondary development or support function expansion;

  • Whether the technical framework of the platform is the mainstream technical framework;

  • Use trials to test how well you fit in with local conditions.

Five, the other

Question 1: What is the relationship between AIOps and automated operations?

AIOps is a part of automated operation and maintenance, and has emerged with the popularity of AI in recent years. Automation involves all aspects of operation and maintenance operations. AIOps only applies AI technology to the existing Ops platform, and generally uses it together with big data technology.

Question 2: Can we combine some advanced technologies, such as cloud computing and big data, to make automatic operation and maintenance more efficient and intelligent?

Combined with cloud computing capability, the service capacity of automatic operation and maintenance platform can be rapidly expanded. The combination of big data and artificial intelligence technology can enable the automated operation and maintenance platform to provide more powerful functions, which is the AIOps that many people begin to pay attention to.

Risks need to be checked by manual workers. For example, automatic operation of certain behaviors based on big data and artificial intelligence technology requires manual double check at the beginning of using this technology and demarcating priorities and importance levels. For a low priority and low importance can be processed automatically.

Question 3: What are the differences between traditional enterprises and Internet enterprises in the focus of operation and maintenance?

The differences between traditional industries and the Internet in operation and maintenance are as follows:

  • Codification of operation and maintenance: the operation and maintenance of traditional industries is still more at the level of manual operation platform or even purely manual operation, while the Internet is more to carry out operation and maintenance through codes, avoiding manual operation, which is why Internet companies have requirements for development ability of operation and maintenance.

  • Point-to-point and linearization: The operation and maintenance of traditional industries have purchased many operation and maintenance platforms at different times, but each operation and maintenance platform is independent and discrete. While the Internet operation and maintenance platform is mostly linear, can achieve end-to-end delivery and series;

  • Different requirements for personnel: No matter what level of operation and maintenance, Internet companies require certain development ability or in-depth understanding of some principles (code level), while traditional industries require more requirements for operation level.

Q4: How can the automated operation and maintenance platform be closer to the business? Timely identification of business risks that have occurred and will occur?

The first step is to collect the automated o&M requirements of the service and meet the automated O&M requirements through the platform.

Second need to monitor the business system, on this basis, the need to communicate with business risk indicators, the indicators of risk quantitatively, and configure to automate operational platform of monitoring system, using 724 hours monitoring platform of monitoring ability, when there is a index to achieve the alarm threshold, through in the form of text messages, WeChat, E-mail and alarm.

Finally, the configuration of risk indicators can be gradually improved through the combination of big data analysis and AI, forming a positive feedback chain suitable for each business system.

Question 5: What is the difference between traditional IT operations and automated operations?

The reason for the emergence of semi-automated operation and maintenance is that these solutions are all point-based problems and turn manual operations at each point into scripted or platform-based automatic actions, which are discrete and point rather than line or plane in nature. True automated operation and maintenance is to achieve end-to-end automated delivery, from development to test to operation and maintenance of the full link automation, eliminating manual operation.

For example, to create a Redis middleware, the semi-automated approach is:

  • Apply for a machine on a virtualization platform;

  • Network assigned IP address (manual);

  • Initialize the machine with a separate script (execute the script manually);

  • Install Redis via the installation script (manual installation);

  • Notify the applicant by email or manual.

The automated approach is to submit the requirement to create Redis, the automated platform does everything, and then invokes the mail interface to notify the applicant.

Question 6: How to define the boundary of independent research and development of automatic operation and maintenance? Can it be autonomous and controllable, and can give full play to and enhance the ability of employees?

There are two ways of thinking about autonomy and control, one is completely self-research; The other is secondary development based on a purchased automated operation and maintenance platform.

For the first case, the company’s staff is required to have certain development ability. The advantage is that the company can fully combine local needs, but the disadvantage is that it has high requirements for personnel and the platform is slow to take shape.

For the second case, need to purchase a platform technology stack to achieve development with the company or operations staff matching platform, and flat island’s open source or provide abundant secondary development interface, advantage is that can quickly meet the needs of at least 80%, disadvantage is the need to understand the existing code, not enough flexibility.

The above 18 questions about enterprise automation operation and maintenance solution, I hope to help you friends ~

Author: Wang Yang

Source: TalkWithTrend Subscription number (ID: TalkWithTrend)