What are the Different Methods to Check for the Table/Relation Data Skew on Tanzu Greenplum?
search cancel

What are the Different Methods to Check for the Table/Relation Data Skew on Tanzu Greenplum?

book

Article ID: 295161

calendar_today

Updated On:

Products

VMware Tanzu Greenplum Greenplum Pivotal Data Suite Non Production Edition VMware Tanzu Data Suite VMware Tanzu Data Suite

Issue/Introduction

This article describes the different options to check data skew on tables.

 


Resolution

The different method to check for data skew on a table/relation.

Method 1

This is the most common and straightforward way to check for skew.

SELECT gp_segment_id, COUNT(*)
FROM <schema_name>.<table_name>
GROUP BY gp_segment_id
ORDER BY 1;

This will count the rows per segment and give a good indication of the skew. This is very useful if you want to know the data distribution of a single table or a list of tables in the database.

The downside to this is, it physically counts the rows for one table at a time. So multiple commands or a script is needed to check for skew in all tables in the database.
 

Method 2

To check the skew for all tables in the database, the views "gp_skew_coefficients" & "gp_skew_idle_fractions" under the gp_toolkit schema is a better option.

The method takes in the same principle as method 1, refer to the "Distribution and Skew" to understand more on the views and refer to the article for additional information.

This view can be really time-consuming if there is a large number of relations and the data size of those relations is large (as it has to scan a large number of blocks ).
 

Method 3

The alternative method of quickly checking the data skew is by checking the table/relation OS file size on all the segment, this method is described in details in the article Checking Data Skew (Alternative Method)  and is much quicker when checking the data skew of all tables in the entire database.

Once you identify the variation of data size (OS size) you can use the method 1 to further understand how the data (records) is distributed across various segments.

The downside of this method, is if the table is heavily bloated then the invalid tuple size is also taken into consideration since the DML doesn't release the invalid OS pages hence the OS file size remain unaltered.

Additional Information

+ Environment:

Pivotal Greenplum Database (GPDB)