Teach you a trick: make the cluster slow node nowhere to hide

Abstract: In the process of GaussDB running in a large-scale cluster, the performance of some nodes may deteriorate seriously as time goes by. At this time, these nodes can still provide services, but the response is slow. The processing time of the same request is much longer than that of other normal nodes, which affects the performance of the whole cluster. Such nodes are called “sub-healthy nodes”, or “slow nodes”.

This article is shared by Huawei cloud community “GaussDB(DWS) New technology Cluster Slow Node Has No Place to hide”, original article: Cauliflower.

background

Slow Example introduction

When GaussDB is running in a large-scale cluster, the performance of some nodes may deteriorate over time. At this time, these nodes can still provide services, but the response is slow. The processing time of the same request is much longer than that of other normal nodes, which affects the performance of the whole cluster. Such nodes are called “sub-healthy nodes”, or “slow nodes”.

The accurate identification of sub-healthy nodes in large-scale cluster is a difficult problem. After in-depth observation and analysis, we found that the slow response of nodes may also be caused by service tilt, for example, centralized access to some data, or the workload of some nodes increases dramatically as some services require more operations. In addition, such nodes are not slow nodes with poor performance. If the slow node is judged by response time alone, it will easily lead to misjudgment, and frequent false alarms will reduce user experience and increase the investigation burden of engineering personnel. This feature requires rapid and accurate identification of slow nodes with full consideration of service tilt.

Slow instance acquisition mechanism

The wait state view pg_thread_WAIT_STATUS in Gauss DB shows the instantaneous state of each thread on each instance. Through in-depth analysis of the monitoring principle and the time of the live network, we find that the view can effectively find the nodes with slow response. In terms of average low, the slower the node response, the more waiting events above, and there is a positive correlation between the two. Therefore, the speed of node response can be measured by querying the number of waiting events of each node in the view. According to the experience of the live network, if most of the WAITING DN in the wait state view is the DN of the same node, the node is likely to be a slow node.

Take the query result of wait state view on ICBC cluster shown in Figure 1 as an example, the number of event waits on node 890 is the highest in the whole network in consecutive queries, and accounts for more than 50% of the total number of waits on the whole network. The final hardware detection shows that the 890 node does have hardware faults, resulting in slow operation.

The data nodes (Datanodes) of GaussDB work in active/standby mode. If all datanodes on a physical machine are standby, the physical machine does not provide services directly, and the number of waiting events cannot be queried using the pg_thread_WAIT_STATUS view. However, the primary DN needs to synchronize logs to the secondary DN. If such a full standby physical machine happens to be a slow node, log synchronization is slow, which affects the performance of the primary DN. Although the number of waits for the PG_THREAD_WAIT_STATUS view on the host increases, the number of waits on a single physical node may not reach the threshold because the primary DN corresponding to the physical machine is spread across multiple physical nodes. In this case, the standby DN corresponding to the primary DN in the wait view needs to be further calculated to indirectly obtain the waiting number of the “full standby” physical machine.

Figure 1.1 Identifies slow nodes by waiting view

Due to active/standby switchover and data skew, the load on nodes in the cluster is unbalanced. High load may cause slow node response. In order to avoid false positives, it is necessary to eliminate the interference of load tilt on sub-healthy node identification. To simplify the judgment, we further assume that load tilt and failure do not occur at the same node at the same time. In other words, if a node has a high load level, the slow response is considered normal and will not be reported as a slow node.

The load level of a node can be measured by data access and computation, which can be quantified by the number of IO, CPU time consumed, and network data traffic. GaussDB internally collects STATISTICS on THE number of I/OS (blocks), CPU time, and network data traffic and displays them in various views. By accessing these views on a regular basis, you can obtain the workload of each node per unit of time, and thus the load level of each node for each period of time.

Like the number of waiting events, the load level of the physical machine of “full standby machine” also needs to be calculated indirectly by the load level of its corresponding main DN. The secondary server rarely participates in computing and mainly receives logs. Therefore, the secondary server is busy based on the I/O quantity and network traffic of the primary DN. And because the accuracy of indirect calculation is relatively poor, only “full standby” nodes will adopt this method. As long as the primary DN exists on a physical machine, the data of the primary DN is directly used.

Monitor slow instances

Scenario analysis

According to the analysis of customer requirements, customer requirements can be defined as the time series and detailed display of the occurrence frequency and frequency of slow nodes in the cluster. Contains the following main functions:

A. Enable or disable slow instance detection

B. Set slow instance detection parameters

C. Collect slow instance information and report it to the DMS database

D. Display the slow instance trigger times and slow instance names in a time series on the page

The overall architecture design of the scheme

Figure 2.1 Overall architecture diagram of slow instance monitoring

The overall solution is to detect new items in slow instances from end to end, and invoke related health check modules to produce slow instance data.

(1) DMS-agent invokes the health check script and passes in the configuration parameters delivered by the collection. The main operations include initial startup call and process interruption restart call;

(2) The health check script collects data on slow nodes and saves it to the user database

(3) DMS-Agent obtains relevant data by checking database

4 DMS-agent reports data according to the reporting frequency delivered by DMS-Collection

(5) DMS-collection receives the data reported by DMS-Agent for DMS database entry.

⑥ DMS-Monitoring Query the DMS database to obtain slow node data. The data is displayed on the front – end page of DWS-Console

⑦ DWS-Console delivers the start/stop configuration to DMS-Agent through DMS-Monitoring, and keeps the start/stop information to THE DMS database through DMS-Monitoring. DWS -console is configured to the DMS database using dmS-monitor persistent monitoring parameters.

⑧ DMS-monitoring Delivers the modified configuration to the DMS-Agent using the GRPC. The DMS-Agent collects and reports data based on the new reporting frequency and configuration

⑨ DMS-Agent restarts the health check script based on the new configuration after the configuration is updated.

Slow instance data interpretation and data processing

Slow instance table structure design

drop table if exists dms_mtc_cluster_slow_inst cascade; Create table dMS_mTC_cluster_slow_inst (ctime bigint not NULL, -- Collection time virtual_cluster_id int not null, -- virtual cluster ID check_time bigint, -- check time host_id int, -- host ID host_name varchar(128), -- host name inst_id varchar(64), Inst_name varchar(128), primary key(virtual_cluster_id, inst_id, check_time));Copy the code

Slow instance data processing

Data in the database DMS-collection Receives the data reported by the Agent and stores the data in the database

insert into
       dms_mtc_cluster_slow_inst (ctime, virtual_cluster_id, check_time, host_id,host_name, host_id, inst_name)
values
      (#{sinst.ctime}, #{ sinst.virtual_cluster_id}, #{sinst.check_time},#{sinst.host_id}, #{sinst.host_name}, #{sinst.inst_id}, #{sinst.inst_name});
Copy the code

Data query

Dms-monitoring Queries the time series and number of slow instances

select
      check_time, count(*) as slow_inst_num
from
      dms_mtc_cluster_slow_inst
where
      virtual_cluster_id =
            (select
                    virtual_cluster_id
             from
                    dms_meta_cluster
             where
                    cluster_id = #{clusterId})
and
      check_time >= #{from}
and
      check_time <= #{to}
group by
      check_time;
Copy the code

Dms-monitoring Queries slow instance data

select
      t1.check_time, t1.host_name, t1.inst_name, t2.trigger_times
from
             (select
                    check_time, host_name, inst_name
             from
                    dms_mtc_cluster_slow_inst
             where
                    virtual_cluster_id = (
                           select
                                  virtual_cluster_id
                           from
                                  dms_meta_cluster
                           where
                                  cluster_id = #{clusterId})
             and
                    check_time = #{checkTime}) as t1
      inner join
             (select
                    inst_name, count(*) as trigger_times
             from
                    dms_mtc_cluster_slow_inst
             where
                    virtual_cluster_id = (select
                           virtual_cluster_id
                    from
                           dms_meta_cluster
                    where
                           cluster_id = #{clusterId})
             and
                    check_time >= (#{checkTime} - 86400000)
             and
                    check_time <= #{checkTime}
             group by
                    inst_name) as t2
      on
             t1.inst_name = t2.inst_name; 
Copy the code

For more information about GuassDB(DWS), please search “GaussDB DWS” on wechat to follow the wechat public account, and share with you the latest and most complete PB series warehouse black technology, the background can also obtain many learning materials ~

Click to follow, the first time to learn about Huawei cloud fresh technology ~