Dingan Liang (Daliang), Director of Operation and maintenance Technology, guest DevOps lecturer at Fudan University. With years of working experience in operation and maintenance, operation and development and DevOps, I was responsible for the operation and maintenance planning and management of Qzone, album and other SNG social platform businesses, and experienced the whole process of standardization, automation and intelligent construction of SNG operation and maintenance. Head of Tencent Cloud.

1

Clickbait once again! This paper mainly introduces the design ideas of operation and maintenance CMDB. Proper CMDB design has unexpected effects on the improvement of operation and maintenance efficiency, such as convergence alarm and fault self-healing.

In operational automation platform design concept, we have always advocated “reduce operational object”, and the operational object abstraction, model, configuration of inputting CMDB in management, and then let the operational tools way consumption data in CMDB, let ops automated process can maintain the CMDB through interfaces all operations in the object’s properties and the state, This is the configuration basis for building an automated o&M system.

Then what is the main content recorded by CMDB? In the practice of CMDB for business architecture of cloud weaving, the configuration data of CMDB can be divided into several categories (more configuration requirements can be extended) :

• Business objects: business tree, architecture layer, etc.

• Hardware objects: hosts, network devices, etc.

• Application objects: software packages, configuration files, and scripts

• Custom objects: change information, password database, etc.

Each o&M object contains its own properties and state configuration, such as:

• Business tree: business hierarchy, responsible person, importance level, etc.

• Host: IP address, host name, operating system, uplink switch port, etc.

• Software packages: versions, deployment instances, processes, ports, and clearing policies;

• Change information: time, IP address, operation content, and operator.

Rules or relationships need to be established between o&M objects, as shown in the following figure:

2

Anyway, how does the design of the CMDB relate to alarm convergence and fault self-healing? Let’s classify the basic alarms commonly encountered by O&M.

• Capacity alarms: CPU and traffic indicator alarms.

• Process port alarm: Process does not exist or zombie process.

• Ping and crash alarms: Ping detection, agent reporting timeout;

• Disk alarm: The disk capacity is full or the disk is read-only.

It’s time for a CMDB system designed for business architecture to come into play. Let’s look at a simple example:

(Click to see larger version)

• Operating status: Operating indicates that a normal alarm is required. This field also contains the status of faulty, testing, and waiting for application, which can correspond to different O&M operations.

• Person in charge: Liang is the person in charge of the operation and maintenance of the business, and the equipment under the business automatically inherits the information of the person in charge;

• Architecture layer: Logical SPP means a standard development framework, a standardized service, with monitoring data of the framework;

• Software package: NgniX-1.9 represents the version of the software, which can be extended to the installation path, start and stop operations, etc.;

• Port: This field can be used to monitor user ports and perform o&M operations such as deployment detection and fault self-healing.

• IP cluster: specifies the DEPLOYMENT IP address of the service module based on the CMDB configuration information or rules.

• Standardized O&M management functions such as automatic inspection, fault self-healing, architecture planning, and capacity consistency.

When a large number of basic alarms break out, based on o&M rules and CMDB configuration records, we will obtain the o&M automation capability in the following scenarios:

• Capacity threshold removal: The capacity of IP clusters converges into a single indicator based on service latitude. Metis single KPI is analyzed intelligently and accurately without threshold management.

• Indicator prediction: alarm of hourly indicator growth slope, a realization of threshold removal;

• Capacity consistency: If the capacity of an IP cluster is inconsistent, the IP cluster is considered abnormal and needs to be repaired in daily OPERATION and maintenance.

• Accurate fault notification: The alarm is accurately pushed to the service operation and maintenance manager without the maintenance IP manager;

• Process or port alarm self-healing: Ps detects that the process or port does not exist and automatically starts the process or port based on the software package rules to rectify the fault.

• Ping/ Shutdown alarm self-healing: Based on the CMDB architecture layer and operating status information, the CMDB automatically invoks the load balancing interface to kick out the stateless service and restart the host. After the fault is rectified, the CMDB joins the load balancing service again.

• Disk alarm self-healing: Based on the host data and log management rules, the disk capacity alarm clearing policy is automatically executed. If a disk is read-only, the disk automatically restarts.

• Active inspection: Based on CMDB configuration information, such as software package version, load capacity of each IP address, and running process port, customize the active inspection o&M tool and generate reports every day to make o&M fire prevention routine and reduce the probability of fire rescue due to inconsistent generation environments.

3

As the operation and maintenance capability achieved in the simple CMDB case above, the effect of alarm convergence and fault self-healing is significant. Here, the author wishes to convey the design ideas of CMDB:

• CMDB is the basic data configuration center of the operation and maintenance system and plays a key role in the architecture of the operation and maintenance platform;

• CMDB avoids large and comprehensive designs, recording limited information can help a lot, starting small;

• Automatic discovery is not a panacea. The management of o&M objects needs to be two-pronged;

• The key of object management in CMDB is to ensure the accuracy of information and consistency of production;

• CMDB to provide a unified interface service, using automated tools or processes to maintain configuration information;

• Do not stick to a single or formal CMDB, configuration data can be continuously utilized as long as the table structure is extended;

• The concept of the generic CMDB helps you better understand the CMDB and the implementation of the CMDB.

Simple CMDB design can not only carry a large number of standardized operation and maintenance rules, but also make use of the data in CMDB in the operation and maintenance experience of Tencent Cloud, which can exert greater power in the ERA of AIOps.

The transition from operation and maintenance technology to operation and maintenance product manager gave me the opportunity to look at and summarize the operation and maintenance system I had done in the past from a new perspective. When communicating with many enterprise customers, I found a typical problem. Traditional operation and maintenance thinking is always used to “cure headache, cure headache, cure foot pain”. Reviewing the operation and maintenance technology practice of Tencent in the past 10 years, the establishment of Tencent Cloud operation and maintenance system is intentionally or unintentionally led by the evolution of business operation and maintenance. Next time I have the opportunity to further discuss the practice of CMDB in more operation and maintenance scenarios with you.

· I am · cut · line

The author of this article, Liang Daliang, and Nie Xin, director of Tencent’s social network operation department, will both attend the Tencent operation and Maintenance special event. To share the practice of operation and maintenance automation and intelligence.

“Every time you share, think about what your audience is getting out of it, because their time is precious.”

— From sharing people

So, the following two dry goods, must also be full of sincerity. We will wait for you at GOPS Shanghai station.

Please click “Read the original article” to register.