Resolving pg_toast errors in Postgres

Products

VMware Aria Suite

Issue/Introduction

This article provides a generic guide on how to solve pg_toast corruption on postgres DB. As a generic guide it does not take all possible cases, but it does cover the most frequent cases and gives you a better understanding of these kinds of issues.

Symptoms:
In the analytics and/or postgres logs you see similar errors:

ERROR: unexpected chunk number 1 (expected 0) for toast value 25693935 in pg_toast_16394; nested exception is org.postgresql.util.PSQLException: ERROR: unexpected chunk number 1 (expected 0) for toast value 25693935 in pg_toast_16394

Environment

Aria Operations 8.x

Cause

This means that there is a corrupted data on the Postgres database. Common causes of this are hardware failure, abrupt power down, and disk full.

Resolution

In any issues where there may be corruption in the database the recommended action is to restore from the last known good backup. The steps below are only used as a last option and may possibly result in a loss of data.

Prior to following any of the actions below, please ensure there is a backup or snapshot per How to take a Snapshot of VMware Aria Operations. You can skip step 11 of the KB in bringing up the cluster or if you are relying on a backup other than a snapshot you can bring the cluster down manually per Aria Operations cluster offline. We need the cluster offline to run the queries in the database. If you aren't comfortable with commands in the database please contact technical support for assistance.

First we need to find the source of the corrupted data (table and/or index).

From the analytics log you may see a select statement prior to the pg_toast error:

ERROR [Analytics Main Thread ] com.vmware.statsplatform.persistence.impl.ResourceDataServiceSqlImpl.fetchAll - error:
org.springframework.jdbc.UncategorizedSQLException: StatementCallback; uncategorized SQLException for SQL [select
resource_id, adapter_kind, resource_kind, resource_name, resource_key
, resource_flag, relationships, adapters, metric_ids
, credential_id, ciq_last_compute_time, ciq_bg_last_compute_time
, ciq_pre_last_compute_time, certificate_ids, creation_time, dt_enabled
, latitude, longitude, monitoring_interval, resource_kind_type
, resource_kind_sub_type, early_warning_alert_enabled, business_tier_info
, group_settings, information, last_maintenance_date, maintenance_expire_date
, migration_source, maintenance_schedule_uuid, internal_id
from resource
]; SQL state [XX000]; error code [0]; ERROR: unexpected chunk number 3 (expected 0) for toast value 1817849 in pg_toast_16394; nested exception is org.postgresql.util.PSQLException: ERROR: unexpected chunk number 3 (expected 0) for toast value 1817849 in pg_toast_16394

In this example we can see the corruption is related to the resource table in the repl database.

Now that we know which table or index is corrupted we will need to reindex to see if it resolves the issue.
1. Log into the primary node as root and bring up Postgres:
```
service vpostgres start
```
2. Switch to postgres user:
3. Log into the database:
4. Verify the table:
5. Reindex the pg_toast and resource table:
If the above query succeeds the indexes are good to go and there is no need to continue with this KB.

If it immediately returns an error for pg_toast the corruption is in the table. We will need to find the corrupted records and remove them by using a process of elimination.
1. Check the maximum id of the internal_id;
2. Dividing by half rounding to the nearest hundred we will attempt to select each half:
3. Depending which query fails we will need to divide by 2 or add it to run the query. For example if the first query fail we will run:
4. Continue this until you narrow it down to the corrupted record. Once you find the corrupt record verify its corrupt and then remove it with:
If there are multiple corrupt records you will need to re-run the queries and remove the records. Once the database has been cleared of all corruption, stop Postgres and start up all Aria Operations services.

Additional Information

Impact/Risks:
There will be data loss when removing corrupted records from the database.