VMware Tanzu Greenplum "heap" tables are susceptible to bloat. Bloat can affect table scanning performance and therefore user query performance.
The following topics are covered in this article:
Table bloat is an accumulation of disk space used by deleted or changed data rows within the table data files. This space has been previously used by data rows which are deleted or changed and not accessible anymore. These rows that are no longer accessible are known as bloat. Failure to do table maintenance in order to allow reuse of this space causes table data file to grow bigger and therefore scans of the table take longer.
The VMware Tanzu Greenplum storage implementation (known as MVCC - MultiVersion Concurrency Control) is based on Postgres.
According to this implementation:
There are no in-place updates (updates are implemented by delete + insert) The row stays in the data file until space is marked as "free" space via the VACUUM command after no more transactions can potentially access it.
Once VACUUM registers the deleted row space as "free space", space can be reused by future inserts and updates. After rows are deleted (and no transactions can access them) and before VACUUM is executed, this space is not marked as free for reuse and is effectively "dead space".
Use the gp_toolkit.gp_toolkit.gp_bloat_diag view - this view shows tables with the moderate and significant amount of bloat
Columns:
In this example, the table "t1" is severely bloated (the calculated data size for data currently in the table is 1 page, but table consists of 97 pages):
gpadmin=# select * from gp_toolkit.gp_bloat_diag; bdirelid | bdinspname | bdirelname | bdirelpages | bdiexppages | bdidiag ----------+------------+------------+-------------+-------------+--------------------------------------- 21488 | public | t1 | 97 | 1 | significant amount of bloat suspected (1 row)
Use the gp_toolkit.gp_bloat_expected_pages - this view shows the page data (current pages/expected pages) for all tables in the database.
Columns:
In this example the tables are shown all have calculated data size of 1 page and actual data file size 1 page. No bloat is detected.
gpadmin=# select * from gp_toolkit.gp_bloat_expected_pages limit 5; btdrelid | btdrelpages | btdexppages ----------+-------------+------------- 10789 | 1 | 1 10794 | 1 | 1 10799 | 1 | 1 5004 | 1 | 1 7175 | 1 | 1 (5 rows)
The VACUUM command is used to mark the deleted rows in table data files as "free space" available for reuse and therefore eliminate bloat. It is normal and healthy for a table with a high rate of UPDATE/DELETE workload to have some amount of deleted rows/free space, which will be reused as new data is written to the table. The regular use of VACUUM ensures that deleted rows are marked as free space as soon as possible and reused.
In certain cases, when VACUUM has not been run for a long time and a huge amount of deleted rows (dead rows) have accumulated in the data files, it comes to a point where the real table rows are just a small amount of the table data files (in other words - table is significantly bloated). In order to eliminate the extra reserved space (if it will never be reused anyway), the VACUUM FULL command can be executed. This command compacts the table data by moving it to the head of the data file(s) and truncates the unused space on the tail.
See KB article for methods to remove bloat from a table/database.
The following steps are recommended to mitigate bloat:
See "Routine System Maintenance Tasks" for further recommended maintenance tasks.
Environment:
VMware Tanzu Greenplum Database (GPDB) all versions