How to use python to sort the table by size using the `ls -l` output (Data Skew issue related)
search cancel

How to use python to sort the table by size using the `ls -l` output (Data Skew issue related)

book

Article ID: 295163

calendar_today

Updated On: 10-17-2023

Products

VMware Tanzu Greenplum

Issue/Introduction

This article will describe how to use a python tool to help us find out the real table size in descending order. 


Resolution

Sometimes when we check the skew issues on the customer end, we need to figure out which table had used most of the disk space. In GPDB, we will make the database file names similar to 6877360, 6877360.1, 6877360.2 or 6877360_fsm under the ./base directory (6877360 is the relfilenode in the pg_class). The extra .1 or .2 on the file name makes it very hard to tell the 6877360's table size by using Linux commands like `ls -lSr`.

Besides that, running `ls -lSr` is often quite slow on customer clusters. We have come up with this tool which takes the ls -l output and sorts the table's relfilenode size in descending order. 

The source code of the tool:

#!/usr/bin/python

import argparse,sys,getopt,re

parser = argparse.ArgumentParser(add_help=False)

parser.add_argument('-v', '--version', action='version',
                    version='%(prog)s 1.0', help="Show program's version number and exit.")

parser.add_argument('-h', '--help', action='help', default=argparse.SUPPRESS,
                    help='segment_file_size.py --input <input_filename_name> --output <output_file_name>')

parser.add_argument('-o', '--output', required=True,help="output file")

parser.add_argument('-i','--input', required=True,help='Input file')

args = parser.parse_args()
files_dic={}
base_files_dic={}
base_file_pattern1='^[0-9]*$'
base_file_pattern2='(^[0-9]*)(\.)([0-9]*$)'
base_file_pattern3='(^[0-9]*)(\_)(FSM$)'


pattern1=re.compile(base_file_pattern1)
pattern2=re.compile(base_file_pattern2)
pattern3=re.compile(base_file_pattern3)

if args.output:
    print("Processing the data")
    with open(args.input) as ls_file:
     for line in ls_file:
             split_line=line.split()
             if len(split_line)>7:
                key,value = split_line[8],int(split_line[4])
                files_dic[key]=value
                if pattern1.match(key):
                    base_files_dic[key]=value


        for key in files_dic.keys():
            if pattern2.match(key):
                m=pattern2.match(key)
                key_base=m.group(1)
                try:
                    base_files_dic[key_base]=files_dic[key]+base_files_dic[key_base]
                except KeyError,e:
                    print "The base file does not exist", e

            if pattern3.match(key):
                m=pattern2.match(key)
                key_base=m.group(1)
                try:
                    base_files_dic[key_base]=files_dic[key]+base_files_dic[key_base]
                except KeyError,e:
                    print "The base file does not exist", e

        with open(args.output,'w') as output:
            for key in sorted(base_files_dic,key=base_files_dic.get, reverse=True):
                output_line="Base_File: "+key+" File_size: "+str(base_files_dic[key])+"\n"
                output.write(output_line)

    print("Done!")

 

Options

The tool takes two options:

-i : The input file which contains the `ls -l` output of the segment data dir. Please use the output of `ls -l` and remove the Linux command line prompt.

-o: This is where the output file will write to.


Expected results:

jiangal-a01:Hang_Zhou alexjiang$ ./segment_file_size.py -i segment_ls/gpseg64_ls.txt -o seg64_sorted_python.out
Processing the data
Done!
jiangal-a01:Hang_Zhou alexjiang$ head seg64_sorted_python.out
Base_File: 6876553 File_size: 100978294784
Base_File: 6873958 File_size: 28821225472
Base_File: 6875729 File_size: 23318757376
Base_File: 6878940 File_size: 11270455296
Base_File: 6877572 File_size: 10212245504
Base_File: 6479517 File_size: 10053605072
Base_File: 6874880 File_size: 8259502080
Base_File: 6877360 File_size: 7991230464
Base_File: 6878843 File_size: 7864254464
Base_File: 6879858 File_size: 7576256512