How to troubleshoot general File system full conditions on Greenplum segment hosts
search cancel

How to troubleshoot general File system full conditions on Greenplum segment hosts

book

Article ID: 295264

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

This article outlines general example commands for troubleshooting and identifying root cause of file system full condition on Greenplum segment hosts.

Additional steps may be necessary according to specific problem cases that may not be fully outlined in this document.
This article may continue to undergo updates according to future case studies to serve as a live troubleshooting guide.

It is additionally recommended to proceed with the steps in this article up to the point that is applicable to the problem case at hand.

Symptoms:
You are experiencing a condition where file system is running at -or close to- 100% usage on Greenplum segment host.
This can potentially manifest itself into different problems including the segment failure that is described in KB article 296119 

Resolution


All the steps below are to be executed from the master host.
Commands that require access to segment hosts are executed using gpssh for process simplification

1. Identify file systems with highest consumption % on each host in the GPDB cluster

 
$ gpssh -f ~/hostfile "df -H | tail -n +2 | sort -k 5 -n -r | head -1"

The following example output shows SDW7 is running 100% full on /data2 file system
 
[ sdw9] /dev/sdc1            9.9T  35G 9.8T  1% /data1
[sdw10] /dev/sdc1            9.9T 426G 9.5T  5% /data1
[sdw12] /dev/sdb1            9.9T 835G 9.1T  9% /data2
[ sdw1] /dev/sdb1            9.9T 840G 9.0T  9% /data2
[sdw11] /dev/sdc1            9.9T 2.6T 7.3T 27% /data1
[ sdw3] /dev/sdb1            9.9T 440G 9.4T  5% /data2
[ sdw7] /dev/sdb1            9.9T 9.9T  56M 100% /data2		<====
[ sdw4] /dev/sdb1            9.9T  36G 9.8T  1% /data2
[ sdw2] /dev/sdb1            9.9T 837G 9.1T  9% /data2
[ sdw6] /dev/sdb1            9.9T  36G 9.8T  1% /data2
[ sdw5] /dev/sdc1            9.9T  36G 9.8T  1% /data1
[  mdw] /dev/sdb1            3.3T 414G 2.9T 13% /data
[ smdw] /dev/sdb1            3.3T 5.5G 3.3T  1% /data
[ sdw8] /dev/sdc1            9.9T 435G 9.4T  5% /data1

2. Depending on this example Focusing on SDW7, display top 30 file system directories sorted by their sizes in descending order  
 
$ HST=sdw7																
## Modify HST to the host that is being investigated

$ gpssh -h $HST 'df -h'	
## Confirm the file system saturation condition exists

$ gpssh -h $HST "df | tail -n +2 | sort -k 5 -n -r | head -1" | awk '{print $NF}'			
## Confirm the result points to the affected file system (e.g., /data2)

$ gpssh -h $HST "du -d 10 -h $(df | tail -n +2 | sort -k 5 -n -r | head -1 | awk '{print $NF}') 2> /dev/null  | sort -h -r | head -30"
## See example result below

The following example output shows the database directory corresponding to Database OID# 16832376 is causing huge space consumption across multiple segments on the affected host
 
[ sdw7] 9.9T  /data2
[ sdw7] 5.1T  /data2/primary
[ sdw7] 4.8T  /data2/mirror
[ sdw7] 1.4T  /data2/primary/gpseg77/base
[ sdw7] 1.4T  /data2/primary/gpseg77
[ sdw7] 1.3T  /data2/primary/gpseg79/base
[ sdw7] 1.3T  /data2/primary/gpseg79
[ sdw7] 1.3T  /data2/primary/gpseg78/base
[ sdw7] 1.3T  /data2/primary/gpseg78
[ sdw7] 1.3T  /data2/mirror/gpseg36/base
[ sdw7] 1.3T  /data2/mirror/gpseg36
[ sdw7] 1.2T  /data2/primary/gpseg76/base
[ sdw7] 1.2T  /data2/primary/gpseg76
[ sdw7] 1.2T  /data2/mirror/gpseg29/base
[ sdw7] 1.2T  /data2/mirror/gpseg29
[ sdw7] 1.2T  /data2/mirror/gpseg22/base
[ sdw7] 1.2T  /data2/mirror/gpseg22
[ sdw7] 1.2T  /data2/mirror/gpseg15/base
[ sdw7] 1.2T  /data2/mirror/gpseg15
[ sdw7] 1008G  /data2/primary/gpseg77/base/16832376			<===
[ sdw7] 981G  /data2/primary/gpseg78/base/16832376			<===
[ sdw7] 978G  /data2/primary/gpseg79/base/16832376			<===
[ sdw7] 967G  /data2/mirror/gpseg36/base/16832376			<===
[ sdw7] 886G  /data2/mirror/gpseg15/base/16832376			<===
[ sdw7] 878G  /data2/mirror/gpseg29/base/16832376			...
[ sdw7] 845G  /data2/primary/gpseg76/base/16832376
[ sdw7] 821G  /data2/mirror/gpseg22/base/16832376
[ sdw7] 281G  /data2/mirror/gpseg36/base/18441260
[ sdw7] 280G  /data2/primary/gpseg78/base/18441260
[ sdw7] 280G  /data2/primary/gpseg77/base/18441260


Since the outlier directories in the example output above reside under “gpseg*/base”, it can be helpful to proceed with investigating through mapping 16832376 to an existing Database OID.
It is however not known if the space accumulation within each of these segment directories is a result of a small set of large tables, or if it is rather a result of a large list of small files.

 

3. Identify the Database details from the following query

$ psql -c 'select * from pg_database where oid=16832376;'


4. Use the following commands to break down and identify the largest files within gpseg77/base/16832376/ which according to the above output is consuming close to 1TB

The following commands should only be relevant if a valid Database is identified from the previous SQL, and provided that all the following commands will run successfully.

 

$ HST=sdw7																	
## Modify HST to the host that is being investigated

$ DDIR=/data2/primary/gpseg77/base/16832376											
## Modify DDIR to the Data directory that is being investigated (** ONLY INCLUDE up to the Database OID subdirectory under base without trailing slashes)
 
$ gpssh -h $HST "du -sh $DDIR"

$ gpssh -h $HST "cd $DDIR && du -h * | sort -h -r | head -20"

Evaluate the output of the previous command to determine whether the space accumulation within 16832376 is caused by a single or a few skewed file(s).

Example output (this is only the top 20 list of biggest files):

[ sdw7] 672K	12601
[ sdw7] 640K	12556
[ sdw7] 512K	12554
[ sdw7] 480K	12632
[ sdw7] 352K	10069
[ sdw7] 320K	10068
[ sdw7] 288K	12617
[ sdw7] 288K	10013
[ sdw7] 256K	10010
[ sdw7] 224K	12606
[ sdw7] 224K	12563
[ sdw7] 224K	12543
[ sdw7] 192K	10056
[ sdw7] 192K	10011
[ sdw7] 160K	10016
[ sdw7] 160K	10012
[ sdw7] 128K	12583
[ sdw7] 96K	12801_fsm
[ sdw7] 96K	12794_fsm
[ sdw7] 96K	12787_fsm


5. Compute the value of the presumably biggest RelFileNode within gpseg77/base/16832376/.

BIGRFN should be 12601, according to the previous example output.
 
$ HST=sdw7
$ DDIR=/data2/primary/gpseg77/base/16832376

$ BIGRFN=$(gpssh -h $HST "cd $DDIR && du -h * | sort -h -r | head -1" | awk '{print $NF}')
$ echo $BIGRFN		       ## Biggest RelFileNode

6. Now, let’s find out the schema and table details for $BIGRFN by querying pg_catalog:

Run the following commands and make sure the variables (BIGRFN, DBNAME, and NSPOID) are set properly based on each preceding command before moving on with the subsequent command.
 
$ DBNAME=$(psql -Atc 'select datname from pg_database where oid = 16832376;')
 
$ echo $DBNAME

$ echo $BIGRFN
 
$ psql $DBNAME -c 'select * from pg_class where relfilenode = '$BIGRFN
 
 
$ NSPOID=$(psql $DBNAME -Atc 'select relnamespace from pg_class where relfilenode = '$BIGRFN)

## If the last two queries return 0 rows, proceed to step 7
 
$ echo $NSPOID		## Namespace (schema) OID
 
$ psql $DBNAME -c 'select * from pg_namespace where oid = '$NSPOID

7. Only if no record is found in pg_class in step 6 using the provided relfilenode:

There is a chance that the mapping relfilenode is different on the master and segment instances, hence the relname can only be obtained from the local segment, which we can identify from the Data Directory path
 
$ psql $DBNAME -c "select * from gp_dist_random('pg_class') where gp_segment_id in (select content from gp_segment_configuration where datadir like '%"$(echo $DDIR | rev | cut -d '/' -f3-4 | rev)"%' ) and relfilenode = "$BIGRFN
 

## Validate the last query returned result hence the relation has been identified


$ NSPOID=$(psql $DBNAME -Atc "select relnamespace from gp_dist_random('pg_class') where gp_segment_id in (select content from gp_segment_configuration where datadir like '%"$(echo $DDIR | rev | cut -d '/' -f3-4 | rev)"%' ) and relfilenode = "$BIGRFN)

$ echo $NSPOID		## Namespace (schema) OID
 
$ psql $DBNAME -c 'select * from pg_namespace where oid = '$NSPOID


If a schema and table names can be identified in step 6 or in step 7, then proceed to investigate the identified table size or distribution for determining the cause of the space fill up condition.