Hi, I’m an aftermarket engineer on the Pivotal Greenplum team. Help our global business users solve database related problems. In practice, we are often asked by database administrators: “My SQL has been running for a long time without results, how should I check?” “What if a SQL in the database is stuck?” Today, we will do some analysis and summary of this content.

In general, common causes of problems such as SQL getting stuck or running for a long time are as follows:

  • SQL not executed because of deadlock
  • SQL waits because Resource Queue or Resource Group has no resources
  • SQL is stuck because of a problem in a segment or a machine

For questions 1 and 2, in Greenplum, we can troubleshoot by querying the status of the PG_LOCKS table and the resource queues/groups in gp_Toolkit /GPCC.

Problem 3, on the other hand, is often very difficult and has to be solved. Below we will detail how to troubleshoot SQL that does not return results for a long time after troubleshooting deadlocks and resource group/queue issues.

First, we need to determine the SQL Session ID, also known as the Connection ID, by using the following statement. For example, if our database has a truncate statement that does not return a result for a long time, we can determine the SessionID by using the following statement

gpadmin=# SELECT * from pg_stat_activity where upper(current_query) like '%TRUNCATE%'; datid | datname | procpid | sess_id | usesysid | usename | current_query | waiting | query_start | backend_start | client_addr | client_port | application_name | xact_start | waiting_reason | rsgid | rsgname | rsgqueueduration -------+---------+---------+---------+----------+---------+------------------------------------------------------------- -----------------+---------+-------------------------------+-------------------------------+-------------+-------------+ ------------------+-------------------------------+----------------+-------+---------+------------------ 16384 | gpadmin  | 15103 | 310 | 10 | gpadmin | TRUNCATE aoco; | | t | 00:24:56 2019-11-12. The 017908-05 00:24:47 2019-11-12. The 147582-05 | | | - 1 PSQL 00:24:56. | 2019-11-12 | 017908-05 the lock  | 0 | unknown | 16384 | gpadmin | 15218 | 91014 | 10 | gpadmin | SELECT * from pg_stat_activity where upper(current_query) like '%TRUNCATE%'; | | | 2019-11-12 f 00:28:06. 973462-05 00:24:58 2019-11-12. The 490902-05 | | | - 1 PSQL | 00:28:06 2019-11-12. 973462-05 | | 0 | unknown |(2 rows)Copy the code

The sess_id shown in the output above is 310. Then, we should pass GPSSH tool to check the process execution of each segment: $GPSSH -f < hostfile > ‘ps – ef | grep con < sess_id > | grep -v idle’ note: Con

corresponds to sessionID 310, con310

Here we exclude the idle process, because idle process is the process of processing its own Slice at the time of allocation, we do not need to care about. From the output of the GPSSH command above, we have a general idea of how all the processes corresponding to this SQL are executing on Greenplum.

  • For example, SQL has only one segment process that is not idle. The most common condition that can cause this condition is data Skew; It is this SQL corresponding table that may have uneven distribution, or it may be the key value of our SQL JOIN that causes uneven distribution during SQL processing.
  • If the SQL process is not idle on only one machine, we can focus on whether the corresponding hardware on that machine is faulty. The most common is disk problems. Is the disk read/write rate always high at 100%? Is there any other SQL stuck on this machine?
  • If we see that the SQL process is evenly distributed across all machines, it is likely that the amount of data that the SQL itself processes has changed, so the time is longer. The worst possibility is that our SQL is slow because it is not well recognized by the Greenplum optimizer.

In either case, strace and pStack help us understand what the SQL is doing at the process level. Therefore, we recommend that all machines be equipped with these two tools. (In addition, for the disk problem in case 2, we recommend the IOTOP tool to identify which process is using more disk resources)

Strace is a tool that dynamically tracks system calls to the process (the interaction between the process and the operating system kernel). You can use Strace -p to view system calls directly, or you can use the following command:

strace -T -f -ff -y -yy -p

In our experience, this strace parameter combination produces better output and more readable results. Strace is a tool that tracks processes dynamically over time, while pStack prints the Call Stack information vertically for processes at a certain point in time. Call Stack lets you know what a process is doing at a particular point in time. Such as:

[root@bdlgpscprd07 gpadmin]# pstack 6495 Thread 2 (Thread 0x7f442920e700 (LWP 6504)): #0 0x00007f4424e2d043 in poll () from /lib64/libc.so.6 #1 0x0000000000bc1ac4 in rxThreadFunc () #2 0x00007f442588f9d1 in  start_thread () from /lib64/libpthread.so.0 #3 0x00007f4424e3686d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f44293d2720 (LWP 6495)): #0 0x00007f4424e2f2e3 in select () from /lib64/libc.so.6 #1 0x00000000008e1cfa in sendSequenceRequest () #2 0x00000000006ee69d in nextval_oid () #3 0x0000000000759606 in ExecMakeFunctionResultNoSets () #4 0x000000000075e662 in ExecProject () #5 0x0000000000784727 in ExecHashJoin () #6 0x0000000000757d64 in ExecProcNode () #7 0x000000000078f249 in SubqueryNext () #8 0x0000000000762d5b in ExecScan () #9 0x0000000000757c09 in ExecProcNode () #10 0x0000000000798b0a in ExecMotion () #11 0x0000000000757b66 in ExecProcNode () #12 0x000000000074d5bc in ExecutePlan () #13 0x000000000074dd76 in ExecutorRun () #14 0x000000000099d634 in ProcessQuery () #15 0x00000000009a0050 in PortalRun () #16 0x0000000000995dc0 in exec_mpp_query () #17 0x0000000000999c95 in PostgresMain () #18 0x00000000008f6e3e in ServerLoop () #19 0x00000000008f9ae0 in PostmasterMain () #20 0x00000000007fb13f in main ()Copy the code

We can see from the callstack above that the process is performing a Hashjoin. So if we see that all processes are slow here, we can try to see if there is a problem with the corresponding HashJoin. (More specific analysis may require code)

For pStack, we can use a simple shell script that prints every 20 seconds to get a better understanding of how the process is working, every 20 seconds.

$ while true; do date; pstack <pid>; sleep 20; done;

After using Strace and pStack we will have a rough idea of the problem. For example, disk operation is slow, for example, most processes are waiting for data on the network, and so on. But if pStack and STace still don’t provide an in-depth understanding of the problem, business users can contact Pivotal technical support for help. (Note: If the problem can reproduce the problem in your environment, it will greatly help Pivotal’s technology support to locate the problem.) So how can we help Pivotal support gather relevant information?

Pivotal Technology Support has developed the GPMT Analyze_session tool to help business users gather information about SQL jams:

GPMT is a great log collection tool and is highly recommended to install in your environment. In addition, if the problem can recur continuously, we recommend that you use Mini-RePro to collect DDL and statistics for the corresponding SQL. We also shared this log collection procedure in a previous article, see the link for details.

That’s all for today’s sharing, thank you very much for your attention, please continue to follow our article series. If you have any questions, leave us a comment, and business users can stay in touch with us through Pivotal Technical support!


For more technical know-how on Greenpum, please visit Greenplum’s Chinese community website