Often it is necessary to terminate a query that is 'hung' or where a session is either running for an excessively long period of time or the session, though active is no longer progressing.
Before terminating the processes associated with the query the following steps can be used to collect materials needed in most cases to provide a reason why it occurred.
Note: These steps are now automated in the analyze_session of the gpmt utility for versions prior to 7.x and the analyze_session of gpsupport utility in versions 7.x and above.
What to collect
1. Basic DB Queries
select * from pg_stat_activity;
select gp_segment_id, * from pg_locks; select * from gp_toolkit.gp_locks_on_relation; select * from gp_toolkit.gp_locks_on_resqueue;
select * from gp_toolkit.gp_resq_activity; select * from gp_toolkit.gp_resqueue_status
2. Basic process information
On all hosts (coordinator & segment), collect basic process information. Given that SESS_ID is the session ID of the hung session, collect the following output:
gpssh -f hostfile 'ps -elfy'
gpssh -f hostfile 'ps -flyCpostgres'
gpssh -f hostfile 'ps -flyCpostgres | grep conSESS_ID'
gpssh -f hostfile 'ps -flyCpostgres | grep conSESS_ID | grep ^D'
gpssh -f hostfile 'ps -flyCpostgres | grep conSESS_ID | grep ^R'
Important: If there are processes that are consistently in IO wait, please halt at this step and verify if there is any OS or hardware related issue (e.g. XFS file system bug, lost NFS mount, faulty RAID controller). Processes stuck in IO wait are often the cause of the 'hung' query and the underlying issue will need to be addressed before going any further.
3. What to Collect from Processes
Based on the above collections, collect information on the processes themselves. There are some caveats to be aware of:
Collect the following information on each segment host and the coordinator host.
ps -flyCpostgres | grep conSESS_ID | grep -v ^D | awk '{print $3}' | while read pid; do pstack $pid > `hostname`-pstack.$pid; done
ps -flyCpostgres | grep conSESS_ID | grep -v ^D | awk '{print $3}' | while read pid; do /usr/sbin/lsof -p $pid > `hostname`-lsof.$pid; done
ps -flyCpostgres | grep conSESS_ID | grep -v ^D | awk '{print $3}' | while read pid; do gcore -o `hostname`-gcore.$pid $pid; done
ps -flyCpostgres | grep conSESS_ID | grep ^R | awk '{print $3}' | xargs -n1 -I '{}' echo '(strace -p {} -o strace -ff ) & sleep 15 ; kill $!' > run_strace.sh
bash ./run_stace.sh