What to Collect to Resolve a "Hung" Query?
search cancel

What to Collect to Resolve a "Hung" Query?

book

Article ID: 295171

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

Often it is necessary to terminate a query that is 'hung' or where a session is either running for an excessively long period of time or the session, though active is no longer progressing.

Before terminating the processes associated with the query the following steps can be used to collect materials needed in most cases to provide a reason why it occurred.

Note: These steps are now automated in the analyze_session of the gpmt utility for versions prior to 7.x and the analyze_session of gpsupport utility in versions 7.x and above.

 

 

Resolution

What to collect

1. Basic DB Queries

  • List of the running queries:
    select * from pg_stat_activity;
  • List of locks on master & all segments:
    select gp_segment_id, * from pg_locks;
    select * from gp_toolkit.gp_locks_on_relation;
    select * from gp_toolkit.gp_locks_on_resqueue;
  • Dump current status of resource queues:
    select * from gp_toolkit.gp_resq_activity;
    select * from gp_toolkit.gp_resqueue_status

2. Basic process information

On all hosts (coordinator & segment), collect basic process information. Given that SESS_ID is the session ID of the hung session, collect the following output:

  • List of all processes running on all segments:
    gpssh -f hostfile 'ps -elfy'
  • List of all postgres processes running on all nodes:
    gpssh -f hostfile 'ps -flyCpostgres'
  • List of all processes for the 'hung' process:
    gpssh -f hostfile 'ps -flyCpostgres | grep conSESS_ID'
  • Verify if any of the processes are stuck in IO wait. This may indicate issues with either the OS or the underlying hardware:
    gpssh -f hostfile 'ps -flyCpostgres | grep conSESS_ID | grep ^D'
  • List any running processes for the query in question that are running
    gpssh -f hostfile 'ps -flyCpostgres | grep conSESS_ID | grep ^R'

Important: If there are processes that are consistently in IO wait, please halt at this step and verify if there is any OS or hardware related issue (e.g. XFS file system bug, lost NFS mount, faulty RAID controller). Processes stuck in IO wait are often the cause of the 'hung' query and the underlying issue will need to be addressed before going any further.

3. What to Collect from Processes

Based on the above collections, collect information on the processes themselves. There are some caveats to be aware of:

  • Do NOT attempt to collect information on any process that is IO wait (see step #4 above). This will hang.
  • For software-only installs the following utilities will need to be installed prior to proceeding:
    • gdb
    • pstack
    • strace
    • lsof

Collect the following information on each segment host and the coordinator host.

  • pstack output for all session processes:
    ps -flyCpostgres | grep conSESS_ID | grep -v ^D | awk '{print $3}' | while read pid; do pstack $pid > `hostname`-pstack.$pid; done
  • lsof output for all sessions processes:
    ps -flyCpostgres | grep conSESS_ID | grep -v ^D | awk '{print $3}' | while read pid; do /usr/sbin/lsof -p $pid > `hostname`-lsof.$pid; done
  • For any process that is running we will need to collect some additional information:
    • A core file for each running process:
      ps -flyCpostgres | grep conSESS_ID | grep -v ^D | awk '{print $3}' | while read pid; do gcore -o `hostname`-gcore.$pid $pid; done
    • Generate a command file to collect the strace output from each of the running processes:
      ps -flyCpostgres | grep conSESS_ID | grep ^R | awk '{print $3}' | xargs -n1 -I '{}' echo '(strace -p {} -o strace -ff )  & sleep 15 ; kill $!' > run_strace.sh
    • Run the above command file and wait for it to complete
      bash ./run_stace.sh

Additional Information