Different methods to check for table/relation data skew in Greenplum
search cancel

Different methods to check for table/relation data skew in Greenplum

book

Article ID: 295161

calendar_today

Updated On: 04-17-2025

Products

VMware Tanzu Greenplum VMware Tanzu Data Suite VMware Tanzu Data Suite

Issue/Introduction

This article describes the different options to check data skew on tables.

Resolution

The different methods to check for data skew on a table/relation.

Method 1

This is the most common and straightforward way to check for skew.

SELECT gp_segment_id, COUNT(*)
FROM <schema_name>.<table_name>
GROUP BY gp_segment_id
ORDER BY 1;

This will count the rows per segment and give a good indication of the skew. This is very useful if you want to know the data distribution of a single table or a list of tables in the database.

The downside to this is, it physically counts the rows for one table at a time. So multiple commands or a script is needed to check for skew in all tables in the database.
 

Method 2

To check the skew for all tables in the database, the views "gp_skew_coefficients" & "gp_skew_idle_fractions" under the gp_toolkit schema is a better option.

The method takes in the same principle as method 1, refer to the "Distribution and Skew" to understand more on the views and refer to the article How is the Value "skccoeff" Under "gp_toolkit.gp_skew_coefficients" Calculated for additional information.

This view can be time-consuming if there is a large number of relations and the data size of those relations is large as it will need to scan a large number of blocks.
 

Method 3

The alternative and quicker method to check the data skew is by checking the table/relation OS file size on all the segment, this method is described in details in the article Checking Data Skew (Alternative Method)  and is much quicker when checking the data skew of all tables in the entire database.

Once you identify the variation of data size (OS size) you can use the method 1 to further understand how the data (records) is distributed across various segments.

The downside of this method, is if the table is heavily bloated then the invalid tuple size is also taken into consideration since the DML doesn't release the invalid OS pages hence the OS file size remain unaltered.