Different Options to Remove Bloat from a Table on Tanzu Greenplum

Products

VMware Tanzu Greenplum

Issue/Introduction

The document describes various options to remove bloat in database tables.

NOTE: analyze the table after any of the following activities to ensure the table statistics are up-to-date to allow the planner/optimizer to generate better plans.

To understand more about the table bloat please refer to the articles:

How to deal with Bloat in Heap tables in VMware Tanzu Greenplum

How to Determine the Size of the Table without Bloat to identify

Resolution

Below are the various ways to remove the table bloat.

1. VACUUM FULL

This can be slow, especially in version 5.x and below. In 5.x and below the VACUUM FULL command moves the data within the file. In 6.x and above the data is written without bloat to a new file and is much faster, but can still take a long time to run on large tables.

The command to execute a VACUUM FULL on the bloated table is:

VACUUM FULL <schema-name>.<table-name>;

NOTE: Any bloated catalog table in the database (that is all tables under pg_catalog schema) must use this method, thus removing highly bloated catalog tables can be very time-consuming.

NOTE: Running "VACUUM" rather than "VACUUM FULL" regularly enough will avoid the need to run "VACUUM FULL". A "VACUUM" is much faster than a "VACUUM FULL". How often to run "VACUUM" depends on the rate of change to the data in the table. See KB article for more details

2. Redistribute

When redistributing the data in a table the database writes the data to a new file and once done removes the old file.

Refer to the article How to Remove Fragmentation (Bloat) on a Relation Using Redistribute on Tanzu Greenplum for details on how to redistribute a table to remove bloat.

NOTE: It is recommended to use the VACUUM command in Greenplum 6.x and above rather than the redistribute method.

3. CTAS

CTAS is another quick method to remove bloat and using this method helps to avoid the table lock (EXCLUSIVE LOCK) which the above two methods acquire to do the operation, thus allowing the table to be read by end users while maintenance is being performed on the main table. But to ensure data is not lost during the procedure, users should not be updating or inserting into the table.
The disadvantage is that it involves many steps to perform the activity. The steps involved are:

Obtain the DDL of the table using the steps mentioned in the KB article, using this helps you obtain all the sub-objects like INDEX that are involved with the relation and also it provides that list of grants provided to the users on the table.
Once the DDL is obtained, replace the <bloated-table-name> to <new-table-name> using the find/replace option with any editor of your choice and then execute the file on psql to create the object on the database.
Then follow the series of steps mentioned below:

INSERT INTO <schema-name>.<new-table-name> SELECT * FROM <schema-name>.<bloated-table-name>;

ALTER TABLE <schema-name>.<bloated-table-name> RENAME TO <old-table-name>;

ALTER TABLE <schema-name>.<new-table-name> RENAME TO <bloated-table-name>;

-- Once users confirm everything is good.

DROP TABLE <schema-name>.<old-table-name>;

4. Backup & Restore

This involves backing up the table and then restoring the back. The different tools that can be used to achieve this are:

gpbackup/gprestore
pg_dump / pg_restore
COPY .. TO .. / COPY .. FROM ..
use a combination of backup utility and psql as shown in this KB article if the amount of free disk space is low

Additional Information

Refer to Managing Bloat in a Database.