This article will describe how to use a python tool to help us find out the real table size in descending order.
Sometimes when we check the skew issues on the customer end, we need to figure out which table had used most of the disk space. In GPDB, we will make the database file names similar to 6877360, 6877360.1, 6877360.2 or 6877360_fsm under the ./base directory (6877360 is the relfilenode in the pg_class). The extra .1 or .2 on the file name makes it very hard to tell the 6877360's table size by using Linux commands like `ls -lSr`.
Besides that, running `ls -lSr` is often quite slow on customer clusters. We have come up with this tool which takes the ls -l output and sorts the table's relfilenode size in descending order.
The source code of the tool:
#!/usr/bin/python import argparse,sys,getopt,re parser = argparse.ArgumentParser(add_help=False) parser.add_argument('-v', '--version', action='version', version='%(prog)s 1.0', help="Show program's version number and exit.") parser.add_argument('-h', '--help', action='help', default=argparse.SUPPRESS, help='segment_file_size.py --input <input_filename_name> --output <output_file_name>') parser.add_argument('-o', '--output', required=True,help="output file") parser.add_argument('-i','--input', required=True,help='Input file') args = parser.parse_args() files_dic={} base_files_dic={} base_file_pattern1='^[0-9]*$' base_file_pattern2='(^[0-9]*)(\.)([0-9]*$)' base_file_pattern3='(^[0-9]*)(\_)(FSM$)' pattern1=re.compile(base_file_pattern1) pattern2=re.compile(base_file_pattern2) pattern3=re.compile(base_file_pattern3) if args.output: print("Processing the data") with open(args.input) as ls_file: for line in ls_file: split_line=line.split() if len(split_line)>7: key,value = split_line[8],int(split_line[4]) files_dic[key]=value if pattern1.match(key): base_files_dic[key]=value for key in files_dic.keys(): if pattern2.match(key): m=pattern2.match(key) key_base=m.group(1) try: base_files_dic[key_base]=files_dic[key]+base_files_dic[key_base] except KeyError,e: print "The base file does not exist", e if pattern3.match(key): m=pattern2.match(key) key_base=m.group(1) try: base_files_dic[key_base]=files_dic[key]+base_files_dic[key_base] except KeyError,e: print "The base file does not exist", e with open(args.output,'w') as output: for key in sorted(base_files_dic,key=base_files_dic.get, reverse=True): output_line="Base_File: "+key+" File_size: "+str(base_files_dic[key])+"\n" output.write(output_line) print("Done!")
The tool takes two options:
-i : The input file which contains the `ls -l` output of the segment data dir. Please use the output of `ls -l` and remove the Linux command line prompt.
-o: This is where the output file will write to.
jiangal-a01:Hang_Zhou alexjiang$ ./segment_file_size.py -i segment_ls/gpseg64_ls.txt -o seg64_sorted_python.out Processing the data Done! jiangal-a01:Hang_Zhou alexjiang$ head seg64_sorted_python.out Base_File: 6876553 File_size: 100978294784 Base_File: 6873958 File_size: 28821225472 Base_File: 6875729 File_size: 23318757376 Base_File: 6878940 File_size: 11270455296 Base_File: 6877572 File_size: 10212245504 Base_File: 6479517 File_size: 10053605072 Base_File: 6874880 File_size: 8259502080 Base_File: 6877360 File_size: 7991230464 Base_File: 6878843 File_size: 7864254464 Base_File: 6879858 File_size: 7576256512