Analyze Session Tool Automates Debug Tracing for Hung Greenplum Sessions
search cancel

Analyze Session Tool Automates Debug Tracing for Hung Greenplum Sessions

book

Article ID: 295765

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

Analyze_session is a sub-tool packaged with gpmt . Given a GPDB session ID, this utility will scan all segments and look for busy processes associated with that session. Analyze session will then proceed to start tracing the identified processes by running pstack, lsof, gcore/packcore, and strace. All the collected data will be put into a single tarfile and should be given to remote support.

Requirements and Caveats

  • lsof, strace, pstack, gcore, gdb must be installed on all hosts in order for this tool to successfully complete.
  • gcore will perform a memory dump of the GPDB process and the size could be anywhere from 300MB to serval GigaBytes. Isolating which hosts to collect using "-hostfile" can be key in saving space.



Environment


Resolution

Usage:

gpmt analyze_session [ -session SESSION_ID ] [ -master-dir DIRECTORY ]
        [ -segment-dir DIRECTORY ]
Description:
Given GPDB session id analyze_session will look for non-idle gpdb processes
    and collect diagnostic information.  The information collected will be used
    remotely for root cause analysis.
    The most common use case for this tool is when a GPDB query process
    appears hung.

  OPTIONS:
    -session           GPDB session id which is referenced in pg_stat_activity
    -master-dir        Working directory for master process
    -segment-dir       Working directory for segment processes
    -free-space        default=10  free space threshold which will abort data collection if reached
    -a                 Answer yes to all prompts

  EXAMPLES:
    Collect Process information for a given GPDB session id
        gpmt analyze_session -session 12345

    Disable prompts
        gpmt analyze_session -session 12345 -a

    Change master directory
        gpmt analyze_session -session 12345 -master-dir /data

    Change segment directory
            gpmt analyze_session -session 12345 -segment-dir /data1 


Simulating a use case

  1. Start by kicking off a query that will run for a very long time:
    create table Hung_analyzer_UnitTest_12345 as select generate_series(1,999999); select count(*) from Hung_analyzer_UnitTest_12345 a,Hung_analyzer_UnitTest_12345 b, Hung_analyzer_UnitTest_12345 c; drop table hung_analyzer_unittest_12345; 
     
  2. Then get the session ID from the database:
    gpadmin=# select sess_id, current_query from pg_stat_activity ;
     sess_id |                                                    current_query
    ---------+---------------------------------------------------------------------------------------------------------------------
           7 | select count(*) from Hung_analyzer_UnitTest_12345 a,Hung_analyzer_UnitTest_12345 b, Hung_analyzer_UnitTest_12345 c;
          21 | select sess_id, current_query from pg_stat_activity ;
    (2 rows) 
     
  3. Run analyze_session to collect information required for remote diagnoses:
    [gpadmin@mdw demo]$ gpmt analyze_session -session 7 
     
  4. The user will be prompted to confirm collection. This prompt offers two benefits:
    1. Let the users see how many processes we are going to collect. If there are thousands of processes found, then you may find the collection will be too large. We should then pass in the "-hostfile" option to gpmt in an effort to limit the collection size.
    2. Secondly, the prompt gives the user a high-level list of only the servers that have the busy processes. This can be useful if you want to quickly check which hosts are busy and which hosts are not. If there are very few busy hosts, then the problem might be processing skew or there is some environmental issue with the small subset of the host.
    The following will be collected:
    
    mdw: 3098 postgres: port  5432, gpadmin gpadmin 192.168.64.100(29054)  con7  192.168.64.100(29054) cmd28 SELECT
    sdw1.gpmt.local: 2881 postgres: port 40000, gpadmin gpadmin 192.168.64.100(50187)  con7  seg0 cmd29 slice3 MPPEXEC SELECT
    sdw1.gpmt.local: 2883 postgres: port 40001, gpadmin gpadmin 192.168.64.100(19324)  con7  seg1 cmd29 slice3 MPPEXEC SELECT
    sdw1.gpmt.local: 6877 postgres: port 40000, gpadmin gpadmin 192.168.64.100(51067)  con7  seg0 cmd29 slice1 MPPEXEC SELECT
    sdw1.gpmt.local: 6879 postgres: port 40001, gpadmin gpadmin 192.168.64.100(20204)  con7  seg1 cmd29 slice1 MPPEXEC SELECT
    sdw2.gpmt.local: 2881 postgres: port 40000, gpadmin gpadmin 192.168.64.100(33582)  con7  seg2 cmd29 slice3 MPPEXEC SELECT
    sdw2.gpmt.local: 2883 postgres: port 40001, gpadmin gpadmin 192.168.64.100(44039)  con7  seg3 cmd29 slice3 MPPEXEC SELECT
    sdw2.gpmt.local: 6793 postgres: port 40000, gpadmin gpadmin 192.168.64.100(34462)  con7  seg2 cmd29 slice1 MPPEXEC SELECT
    sdw2.gpmt.local: 6795 postgres: port 40001, gpadmin gpadmin 192.168.64.100(44919)  con7  seg3 cmd29 slice1 MPPEXEC SELECT
    
    Proceed with collection? Yy|Nn (default=Y): y
    Proceed with collection? Yy|Nn (default=Y): y
    
    Starting to collect segment process information
    Estimated Duration is 180 seconds
    |==================>                                |[%36]
     
     


Additional Information

Options

Isolate Specific segment hosts

gpmt  has a global option "-hostfile" that can be used to isolate which segment servers analyse_session should search. This can help in situations where there is a specific set of servers currently suspected or in cases where we don't need data from all nodes and a collection on a single node is more than adequate.

Process in IO wait or defunct

If analyze_session determines the process is stuck in IO wait or is defunct, then the collection will immediately stop and processes information will be presented to the user. Processes stuck in either of these states immediately suggest the root cause for the hung session.