Abstract: Live network in the use of dynamic load management, often appear a lot of waitin CCN situation, we will think that hung live or what, very anxious, but wait CCN is actually a state of waiting for resources, in this summary of a CCN problem processing blog, CCN problems can be dealt with through this post.

This document is shared in the Huawei cloud community. When GaussDB(DWS) waits in CCN Queue, how can I Locate and process the FAULT quickly? By Malick.

preface

When dynamic load management is used in live network, there are often many wait in CCN situations, people will think that hung or what to do, very anxious, but wait CCN is actually a state of waiting for resources, in this summary of a CCN problem handling blog, CCN problems can be handled through this post.

Background knowledge

1. Which is CCN:

Connect the environment,

Source environment variable

source/opt/huawei/Bigdata/mppdb/.mppdbgs_profile

Perform:

cm_ctl query -Cv | grep Cen -A 4

The results are as follows:

5003 is the CCN of the cluster.

What is CCN: CCN is a cluster concurrency control brain. All complex jobs will apply for resources to CCN, and only the statements that apply for resources can be delivered. Complex statements are uniformly recorded in CCN.

2. View Explanation:

  • pg_stat_get_workload_struct_info();

  • Totalsize represents the total memory that CCN can allocate. Totalsize is the maximum dynamic memory. Freesize_limit is the maximum memory available for CCN allocation, which is 80% of the maximum dynamic memory. Freesize represents the current free memory.

  • You only need to pay attention to the centralWaiting /running number in the diagram (global can be ignored, it belongs to another data structure, and centralwaiting is duplicate information). . Each line represents a statement. Running indicates that the statement is running, waiting indicates that the statement is queuing. QueryId indicates the thread id of the statement, corresponding to lwTID in pg/ PGXC_THREAD_WAIT_STATUS and processid in pg_sessiION_wlmstat.

  • pg_session_wlmstat/pgxc_session_wlmstat();

Step 1 determine the problem scenario

  • Connect CCN to query the following statements to determine the problem scenario:

The first step is to query pgxC_stat_activity to determine whether a large number of statements are in wait CCN. Or statements for a resource pool are in wait CCN.

  • Query pg/ PGxC_session_wlmstat to check whether all complex statements are queued. Or statements in the same queue are queued.

The first step is to connect the CCN node and query

select * from pg_stat_get_workload_struct_info();

Second, query pgxc_session_wlmstat();

select threadid,processid,usename,attribute,status,enqueue,statement_mem,active_points,control_group,resource_pool,substring(qu ery,position(‘explain’in query),20) as subquery from pg_session_wlmstat order bystatus,attribute,usename,subquery,resource_pool;

Determine which solution to use based on the following scenarios:

1) If some statements are in the Running state in the Workload view, and the Running statements occupy a large amount of memory and freesize, and a large number of statements are in waiting state, scenario 1 of problem processing can be basically determined.

2) If the workload view has a running state, but the PGxC_STAT_activity or PG_session_wlmstat view has only a waiting state, and the WORKLOAD view has a running state, Two or more statements have the same qID. queryId value. So basically decide to go problem solving scenario two.

3) If all statements are in waiting state and no statements are in running state, scenario 3 is basically processed.

Processing scenario a large memory statement caused problems

The first step is to find statements in the Workload view that consume too much memory.

As shown in the figure above, the total available memory is 1638MB. One statement currently running occupies 1048MB, and the remaining memory freesize=590MB

At this point, the estimated size of other statements is 600MB, so there is no memory to deliver, until the end of the 1048 statement, memory release will be normal.

The second step is to find the PID of the statement according to its corresponding qid.queryId. The picture above is 9145

Select coorname, pid, usename, substr (query, 0, 30) frompgxc_stat_activity a, pgxc_thread_wait_status b where Amy polumbo id = b.t id and b.lwtid= $qid.query_id;

Step 3 Check the large memory statement according to PID and CN. It can be restored after the memory is freed.

Troubleshoot hash or other statement residues in scenario 2

Step 1 Confirm the concurrent configuration on the problematic resource pool:

select * from pg_resource_pool;

Step 2 If only the resource pool concurrency limit is reached, for example, if the resource pool concurrency is set to 10, the remaining number of running statements is 10. Because the resource pool concurrency limit is reached and all statements are in the waiting state, adjust the queue concurrency to -1. After the limit is not set, the statements waiting for concurrent statements can be delivered.

The following uses son_pool as an example:

alter resource pool son_pool with(active_statements=-1);

Step 3 Clean up the problem statement (the connection is constantly open, the thread is not released, residual information will not be automatically cleaned up)

Note: The processed statement is processed according to whether the processed statement still exists. If the processed statement does not exist, the processed statement is processed. If the processed statement has been processed, the thread is not released. Residue will not be automatically cleaned.

  • Determination of problem statements:

The duplicated qid. QueryId statement in the Workload view is a problem statement. The problem thread repeats two statements, perhaps one of which is normal and the other is residual. It could all be problematic, but at the end of the day only one active statement is actually queued or executed.

2) Clean up the problem statement method, based on the duplicate qid. QueryId mentioned in 1) above, find the problem statement:

Select coorname, pid, usename, substr (query, 0, 30) frompgxc_stat_activity a, pgxc_thread_wait_status b where Amy polumbo id = b.t id and b.lwtid= $qid.query_id;

Third, pg_terminate_BACKEND (PID) is used to check and kill residual statements based on PID and CN. Restore after releasing concurrency and memory resources.

Troubleshoot the long jump lock problem in scenario 3

The first step is to identify the problem

Call stack

gstack $ccn_pid >ccnStack.log

grep grep pthread_mutex_lockccnStack.log

If the result is similar to the following, confirm the problem

The second step is emergency treatment

Treatment methods:

kill -9 ccn_pid

Click to follow, the first time to learn about Huawei cloud fresh technology ~