This article outlines general example commands for troubleshooting and identifying root cause of file system full condition on Greenplum segment hosts.
Additional steps may be necessary according to specific problem cases that may not be fully outlined in this document.
This article may continue to undergo updates according to future case studies to serve as a live troubleshooting guide.
It is additionally recommended to proceed with the steps in this article up to the point that is applicable to the problem case at hand.
Symptoms:
You are experiencing a condition where file system is running at -or close to- 100% usage on Greenplum segment host.
This can potentially manifest itself into different problems including the segment failure that is described in KB article 296119
$ gpssh -f ~/hostfile "df -H | tail -n +2 | sort -k 5 -n -r | head -1"
[ sdw9] /dev/sdc1 9.9T 35G 9.8T 1% /data1 [sdw10] /dev/sdc1 9.9T 426G 9.5T 5% /data1 [sdw12] /dev/sdb1 9.9T 835G 9.1T 9% /data2 [ sdw1] /dev/sdb1 9.9T 840G 9.0T 9% /data2 [sdw11] /dev/sdc1 9.9T 2.6T 7.3T 27% /data1 [ sdw3] /dev/sdb1 9.9T 440G 9.4T 5% /data2 [ sdw7] /dev/sdb1 9.9T 9.9T 56M 100% /data2 <==== [ sdw4] /dev/sdb1 9.9T 36G 9.8T 1% /data2 [ sdw2] /dev/sdb1 9.9T 837G 9.1T 9% /data2 [ sdw6] /dev/sdb1 9.9T 36G 9.8T 1% /data2 [ sdw5] /dev/sdc1 9.9T 36G 9.8T 1% /data1 [ mdw] /dev/sdb1 3.3T 414G 2.9T 13% /data [ smdw] /dev/sdb1 3.3T 5.5G 3.3T 1% /data [ sdw8] /dev/sdc1 9.9T 435G 9.4T 5% /data1
$ HST=sdw7 ## Modify HST to the host that is being investigated $ gpssh -h $HST 'df -h' ## Confirm the file system saturation condition exists $ gpssh -h $HST "df | tail -n +2 | sort -k 5 -n -r | head -1" | awk '{print $NF}' ## Confirm the result points to the affected file system (e.g., /data2) $ gpssh -h $HST "du -d 10 -h $(df | tail -n +2 | sort -k 5 -n -r | head -1 | awk '{print $NF}') 2> /dev/null | sort -h -r | head -30" ## See example result below
[ sdw7] 9.9T /data2 [ sdw7] 5.1T /data2/primary [ sdw7] 4.8T /data2/mirror [ sdw7] 1.4T /data2/primary/gpseg77/base [ sdw7] 1.4T /data2/primary/gpseg77 [ sdw7] 1.3T /data2/primary/gpseg79/base [ sdw7] 1.3T /data2/primary/gpseg79 [ sdw7] 1.3T /data2/primary/gpseg78/base [ sdw7] 1.3T /data2/primary/gpseg78 [ sdw7] 1.3T /data2/mirror/gpseg36/base [ sdw7] 1.3T /data2/mirror/gpseg36 [ sdw7] 1.2T /data2/primary/gpseg76/base [ sdw7] 1.2T /data2/primary/gpseg76 [ sdw7] 1.2T /data2/mirror/gpseg29/base [ sdw7] 1.2T /data2/mirror/gpseg29 [ sdw7] 1.2T /data2/mirror/gpseg22/base [ sdw7] 1.2T /data2/mirror/gpseg22 [ sdw7] 1.2T /data2/mirror/gpseg15/base [ sdw7] 1.2T /data2/mirror/gpseg15 [ sdw7] 1008G /data2/primary/gpseg77/base/16832376 <=== [ sdw7] 981G /data2/primary/gpseg78/base/16832376 <=== [ sdw7] 978G /data2/primary/gpseg79/base/16832376 <=== [ sdw7] 967G /data2/mirror/gpseg36/base/16832376 <=== [ sdw7] 886G /data2/mirror/gpseg15/base/16832376 <=== [ sdw7] 878G /data2/mirror/gpseg29/base/16832376 ... [ sdw7] 845G /data2/primary/gpseg76/base/16832376 [ sdw7] 821G /data2/mirror/gpseg22/base/16832376 [ sdw7] 281G /data2/mirror/gpseg36/base/18441260 [ sdw7] 280G /data2/primary/gpseg78/base/18441260 [ sdw7] 280G /data2/primary/gpseg77/base/18441260
3. Identify the Database details from the following query
$ psql -c 'select * from pg_database where oid=16832376;'
4. Use the following commands to break down and identify the largest files within gpseg77/base/16832376/ which according to the above output is consuming close to 1TB
The following commands should only be relevant if a valid Database is identified from the previous SQL, and provided that all the following commands will run successfully.
$ HST=sdw7 ## Modify HST to the host that is being investigated $ DDIR=/data2/primary/gpseg77/base/16832376 ## Modify DDIR to the Data directory that is being investigated (** ONLY INCLUDE up to the Database OID subdirectory under base without trailing slashes) $ gpssh -h $HST "du -sh $DDIR" $ gpssh -h $HST "cd $DDIR && du -h * | sort -h -r | head -20"
Evaluate the output of the previous command to determine whether the space accumulation within 16832376 is caused by a single or a few skewed file(s).
Example output (this is only the top 20 list of biggest files):
[ sdw7] 672K 12601 [ sdw7] 640K 12556 [ sdw7] 512K 12554 [ sdw7] 480K 12632 [ sdw7] 352K 10069 [ sdw7] 320K 10068 [ sdw7] 288K 12617 [ sdw7] 288K 10013 [ sdw7] 256K 10010 [ sdw7] 224K 12606 [ sdw7] 224K 12563 [ sdw7] 224K 12543 [ sdw7] 192K 10056 [ sdw7] 192K 10011 [ sdw7] 160K 10016 [ sdw7] 160K 10012 [ sdw7] 128K 12583 [ sdw7] 96K 12801_fsm [ sdw7] 96K 12794_fsm [ sdw7] 96K 12787_fsm
$ HST=sdw7
$ DDIR=/data2/primary/gpseg77/base/16832376
$ BIGRFN=$(gpssh -h $HST "cd $DDIR && du -h * | sort -h -r | head -1" | awk '{print $NF}')
$ echo $BIGRFN ## Biggest RelFileNode
$ DBNAME=$(psql -Atc 'select datname from pg_database where oid = 16832376;') $ echo $DBNAME $ echo $BIGRFN $ psql $DBNAME -c 'select * from pg_class where relfilenode = '$BIGRFN $ NSPOID=$(psql $DBNAME -Atc 'select relnamespace from pg_class where relfilenode = '$BIGRFN) ## If the last two queries return 0 rows, proceed to step 7 $ echo $NSPOID ## Namespace (schema) OID $ psql $DBNAME -c 'select * from pg_namespace where oid = '$NSPOID
$ psql $DBNAME -c "select * from gp_dist_random('pg_class') where gp_segment_id in (select content from gp_segment_configuration where datadir like '%"$(echo $DDIR | rev | cut -d '/' -f3-4 | rev)"%' ) and relfilenode = "$BIGRFN ## Validate the last query returned result hence the relation has been identified $ NSPOID=$(psql $DBNAME -Atc "select relnamespace from gp_dist_random('pg_class') where gp_segment_id in (select content from gp_segment_configuration where datadir like '%"$(echo $DDIR | rev | cut -d '/' -f3-4 | rev)"%' ) and relfilenode = "$BIGRFN) $ echo $NSPOID ## Namespace (schema) OID $ psql $DBNAME -c 'select * from pg_namespace where oid = '$NSPOID