VMware Data Services Manager (DSM) PostgreSQL Database transitions to NotOperational state with BackupLocationConnectivityIssue

Products

VMware Data Services Manager for VCF

Issue/Introduction

A managed PostgreSQL database cluster transitions to a CRITICAL alert level and a NotOperational state.

Client applications are unable to connect to the database, resulting in an outage.

Reviewing the PostgreSQL Cluster manifest (YAML) or running kubectl describe PostgresCluster <db-name> reveals the following condition statuses:

DatabaseEngineReady: False (Reason: postgres is not accepting connections normally)

WalArchiving: False (Reason: UnknownFailure)

AutomatedBackup: False (Reason: BackupLocationConnectivityIssue)

The error logs for the AutomatedBackup condition indicate a network timeout when connecting to the configured backup location (e.g., HostConnectError: timeout connecting to <S3-endpoint>:443).

Running df -h on the database VM shows the /data/vpgsql/pg_wal partition is at 100% capacity.

Environment

VMware Data Services Manager for VCF 9.x

Cause

This issue occurs when the database loses network connectivity to its configured backup storage endpoint (e.g., an S3 bucket).

VMware Data Services Manager utilizes pgBackRest for database backups and Write-Ahead Log (WAL) archiving. When the backup storage endpoint becomes unreachable, pgBackRest cannot offload the WAL files. To prevent data loss, pgBackRest indefinitely queues these WAL files on the local disk.

If the network connectivity is not restored, the queued WAL files will eventually consume 100% of the available disk space in the /data/vpgsql/pg_wal partition. Once the disk is completely full, the PostgreSQL engine initiates a protective halt to prevent database corruption, instantly dropping all connections and transitioning to the NotOperational state.

Resolution

To permanently resolve this issue, you must restore network connectivity between the DSM nodes and the configured backup storage endpoint. Once network routing is re-established, pgBackRest will automatically flush the queued WAL files to the backup target, freeing up local disk space and allowing the PostgreSQL engine to recover.

Workaround
If network connectivity cannot be restored immediately, you can use the following methods to temporarily restore database operations:

Option 1: Temporarily Expand Database Storage (Recommended Immediate Action)
Expanding the storage provides immediate breathing room for the database engine to transition out of the NotOperational state.

Log in to the vSphere Client or use the DSM API/manifest to increase the storage capacity of the affected database cluster's data disk.

Once the disk expands, PostgreSQL will have sufficient space to boot up and begin accepting connections while the root network issue is investigated.

Option 2: Modify WAL Archiving Parameters
If expanding the disk is not feasible, you can manually force the system to clear the WAL backlog. Note: Performing this action may break the continuous backup chain, requiring a new full backup once network connectivity is restored.

Log in directly or use an SSH client to access the DSM Provider Appliance .

Edit the /data/pgbackrest.conf file.

Search for the property archive-push-queue-max.

Reduce the default value (typically 60GiB) to a lower value. This forces pgBackRest to drop older WAL files and immediately frees up local data disk storage.

Additional Information

For more information see the following Techdoc:
Configure the Storage on a Data Disk in VMware Data Services Manager.