Abstract: This paper attempts to conceptually and logically deduce the possible application scenarios of database intelligent monitoring system based on user role.

This article is shared by Master Lu from The Application Scenario Analysis of GaussDB(DWS) Database Intelligent Monitoring System in Huawei Cloud community.

Similar to the project approval mode of Internet products, when we define and design a new product, we first need to analyze the needs of users, summarize and synthesize the needs of users, and define our product positioning, functions, business logic, user interface and so on. Therefore, in order to design a good database intelligent monitoring system, we need to do demand analysis of the target users of the database monitoring system, collect user demands, dig the potential needs of users, and draw the portrait of typical users. Finally, the implementation architecture of the database monitoring system is designed, and the various requirements of typical users are incorporated into the design architecture pipeline of the product.

Users of the intelligent database monitoring system

In actual application scenarios, users of a database monitoring system may have many different roles. Different companies may have more segmented or aggregated user roles due to different organizational structures. But in general, it can be summarized into the following three types of users:

  • APP DEV

  • Operation & Maintenance Engineer (SRE)

  • Database Administrator (DBA)

Application development Engineer role: Mainly responsible for developing business SQL in cloud applications, responsible for the function and performance of cloud services. At the same time, you need to ensure that the WRITTEN SQL is high quality and efficient, and does not cause additional resource consumption and time consumption to the cluster. Therefore, application development engineers need to be able to monitor the newly developed SQL and understand the execution efficiency and resource consumption of the new query statement.

O&m Engineer Role: Responsible for ensuring the long-term and stable running of the database cluster. The database system needs to be evaluated in terms of resource consumption and system load. You must be able to configure alarm scenarios for the database, view real-time or predicted database alarm information, and report discovered problems to the database administrator for further handling. In general, the operations engineer role will monitor a large number of database clusters, and he will not do a very in-depth analysis of each cluster, but more of a problem finder role.

Database administrator role: Locates root causes of database problems and provides solutions. A database administrator must be an expert in the database field and be familiar with all aspects of the database. He can analyze database monitoring data from multiple dimensions, locate database faults, and provide solutions.

It should be noted that the above three roles do not refer to positions in the actual production environment, but are typical role symbols summarized for the convenience of analyzing user needs. In a real production environment, there may be a scenario where all three roles are the same person, or where the SRE position has both the SRE and DBA roles. We divide users into three roles here, mainly for the convenience of requirements analysis and the construction of the corresponding portrait, so as to further lock the tools required by the corresponding role. Finally, presents to everyone a clear idea of database monitoring system development concept context.

Intelligent database monitoring system tool and its application scenarios

Through the above abstraction and combing, we find that the three roles correspond to different requirements in the process of database monitoring operation and maintenance, and different requirements will inevitably lead to different tools or different focuses of the same tool. Let’s look at each of the three roles and detail the tools they will use:

Application development roles, who only care if their SQL is efficient, uses the cluster’s various optimization features, and consumes too much of the cluster’s resources? Therefore, he needs a tool that allows him to evaluate the execution efficiency of the SQL he writes, namely the WebSQL tool, which allows users to simply connect to the database and execute SQL statements. WebSQL can return the execution results and execution plans of SQL statements to help application development roles understand the execution efficiency of their SQL statements. At the same time, the user’s SQL statements are not simply executed in a single statement, but need to be executed across the entire job flow. A baseline to measure execution time and resource consumption in the job flow becomes important. Therefore, we need a query monitor that can record execution time and resource consumption for a feature’s SQL, and calculate the maximum, minimum, and average values as a baseline to further help users evaluate the execution efficiency of their SQL. At the user site, where a user’s job is bound to a workload queue for execution due to resource isolation requirements, data such as the allocation of resources that work is placed in the queue and the load level of the workload queue becomes important. Will the workload queue be overloaded with new SQL statements? The application development role needs to have an intuitive understanding of whether the resources currently allocated to the queue are appropriate before the application goes online.

System Operation and Maintenance roles (SRE), who are concerned about the long-term stability of the large number of database systems in the cloud, based on this need, we intend to provide three aspects of the tools to solve the problem.

The health index index is a composite index supported by two indicators, resource consumption index and database system load index. And these two indexes are supported by the atomic index and the extension index of the next layer. The calculation of cluster health index needs to design a set of corresponding mathematical model, based on which we can quantify the system health index, so that the system administrator can quickly find the problem database from the hundreds of databases on the cloud very easily.

In addition to passive indicators such as health indicators, which need to be viewed by the system administrator in person, the DMS further provides comprehensive alarm capabilities. The DMS provides the alarm capabilities of the database from three levels: (1) On the DMS-Agent, it analyzes the logs of the operating system and database on the DMS-Agent node in real time by means of log analysis. When a threat keyword is found, the alarm is triggered immediately and reported to the alarm platform through the corresponding channel. (2) On the DMS server, because THE DMS has all the monitoring data of the database cluster, we can design corresponding alarm rules by means of data analysis and database expertise, periodically check the database cluster, and directly trigger alarms when problems are found; (3) The database cluster indicator data collected by the DMS can be used as the threshold alarm indicator. All database cluster indicators are connected to CES and used for threshold alarms. The preceding three types of alarms must be configured and displayed on the front-end page of the DMS.

There is a natural connection between artificial intelligence and cloud computing. When the database is in the cloud, AIOps, the intersection node of artificial intelligence and database operation and maintenance, will naturally appear. Because the DMS has all the monitoring data of the data cluster, it uses the historical monitoring data to determine the working mode of the cluster and recommends the optimal configuration parameters. Forecast the growth trend of database disk space and inform users of capacity expansion or operation and maintenance requirements in advance. All this is possible thanks to artificial intelligence.

The role of database administrator (DBA), database administrator has always been the steward of the database, in the traditional data center, they are responsible for the performance optimization of the database, also responsible for the long-term stable operation of the database, sometimes even to help application development engineers optimize SQL. However, in the cloud era, the division of work of database administrator will become more refined, application development and system administrators share a part of the work of database management, making the role of database administrator more pure. As a database domain expert, the database administrator is responsible for locating the root causes of database problems and providing solutions. The two roles of system administrator and database administrator finally form the task loop of finding problems, analyzing problems and solving problems. Therefore, in the cloud, an SRE position often includes the responsibilities of the SRE+DBA roles.

A DBA is a database expert and a master at using database tools to locate various database problems. To locate the root cause of the problem, he will need two types of tools: fault analysis tools and fault self-healing tools. The fault analysis tool provides different visualized forms of monitoring data to help the database administrator quickly locate the root cause of a problem. Fault self-healing tools solidify the database administrator’s past experience in locating and solving problems. There will be more self-healing tools in the future as we learn more about how DBAs work.

Another important role of the database administrator is to provide troubleshooting solutions, which is a very important part of the operation and maintenance system. No matter how good a fault locating tool is, if there is no solution to the problem, it will not really help the user. Therefore, we need to build a professional root cause solution search engine, helping users also help us to accelerate the process of solving problems, ease the work intensity of the front line customer support staff.

This paper is the second of three articles to introduce the core concepts of the design of the database monitoring operation and maintenance system on the cloud. It tries to deduce the possible application scenarios of the database intelligent monitoring system based on the user role from the concept and logic. With this basic framework in place, the work and tools we need to do next are clear. May our expectation become an indication as soon as possible, so that the cloud database operation and maintenance work become more easy and intelligent.

Want to know more information about GuassDB(DWS), welcome to wechat search “GaussDB DWS” follow the wechat public account, and share with you the latest and most complete PB series warehouse black technology, background can also obtain a lot of learning materials oh ~

Click follow to learn about the fresh technologies of Huawei Cloud