How to remediate an unhealthy postgres cluster.

Products

VCF Automation

Issue/Introduction

Sometimes, the PostgreSQL cluster may enter an unhealthy or inconsistent state where the primary node remains active while one or more replica nodes become stuck, or the leader node itself may remain in a stopping, starting, or restarting state.

As a result of this condition, the VCF Identity Broker service on the appliance may become unavailable or fail to function properly. Operations against the service may return "HTTP 500 Internal Server Errors", and the VCF Automation UI may display the message: "no healthy upstream"

This issue typically indicates instability or failure within the underlying PostgreSQL cluster services impacting dependent VCF components.

Environment

VCF 9.0
VCF 9.1

Cause

This issue may occur due to repeated PostgreSQL node failovers or underlying storage-related problems, which can leave the cluster in an inconsistent state.

In such scenarios, one or more replica nodes may fail to recover or rejoin the cluster automatically, resulting in instability of the PostgreSQL service and dependent VCF components.

Resolution

Step 1: Prerequisites

Ensure you have the following before running the script.

The IP address of a cluster control plane node (NODE_IP).
- For VCF management services, this can be identified via VCF Operations UI: Navigate to Build → Lifecycle → VCF Services Runtime to find Control Plane Node IPs.
- For VCF Automation, this can be IP of any nodes of that component.
The SSH password for the `vmware-system-user` account
The component name to diagnose (e.g., `vidb`, `vcfa`)
sshpass` installed on your local machine:
- macOS: `brew install hudochenkov/sshpass/sshpass`
- Ubuntu/Debian: `sudo apt-get install sshpass`
- RHEL/CentOS: `sudo yum install sshpass`

Note: The script SSH's into the node VM and uses the on-node kubeconfig automatically. No local kubeconfig setup is required.

Step 2: Check cluster health

The pg_cluster_health_and_remediation.sh script is designed to automate cluster health checks and resolve common Postgres issues.

pg_cluster_health_and_remediation.sh Expand source

Key Capabilities:

Health Monitoring: Identifies disk space usage and WAL bucket usage.
Automated Repairs: Recovers replicas stuck in problematic states (e.g., crashed, stopped, in archive recovery, start failed, restart failed, starting). It can also repair when leader is in starting, restarting, stopping state given that all other dependent services are up and running.

How to Execute: To run the script, provide the node VM IP, SSH password, and the component name.

./pg_cluster_health_and_remediation.sh <node_vm_ip> <ssh_password> <component_name>

Example:

./pg_cluster_health_and_remediation.sh #.#.#.# MyPassword123 vidb // for VIDB

./pg_cluster_health_and_remediation.sh #.#.#.# password vcfa // for VCFA

Step 3: Follow the prompt

The script will ask for confirmation before making any changes:

- Removing readiness probes

- Reinitializing failed replicas

- Restarting PostgreSQL services

Example 1: Here is an example output of a happy path when cluster was healthy

Happy path

pdixit@MHXPWP4FTH hack % ./pg_cluster_health_and_remediation.sh #.#.#.# $PASSWORD vidb
[INFO] Starting PostgreSQL Diagnosis - Remote Mode

[INFO] Configuration:
Node VM IP : #.#.#.#
Component Name : vidb
SSH User : vmware-system-user

[INFO] Testing SSH connectivity to #.#.#.#...
[INFO] SSH connectivity verified

[INFO] Executing diagnosis on remote node...
[INFO] This will run diagnosis on #.#.#.#

[INFO] Uploading diagnosis script to remote node...
[INFO] Executing diagnosis script on remote node...

[INFO] Retrieving namespace from component: vidb
[INFO] Found namespace: vidb-external

[INFO] Validating environment on remote node...
[INFO] Environment validation passed
[INFO] Namespace: vidb-external
[INFO] Kubeconfig: /etc/kubernetes/admin.conf

[INFO] Discovering PostgreSQL resources...
[INFO] Found PostgreSQL instance: vidb-postgres-instance
[INFO] Found StatefulSet: vidb-postgres-instance
[INFO] Replicas: 3/3 available

[INFO] Finding master pod...
[INFO] Using pod for diagnosis: vidb-postgres-instance-1

[INFO] === PostgreSQL Cluster Status ===

+ Cluster: vidb-postgres-instance (7606501767881945181) +-----------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------------------------+-------------+--------------+-----------+----+-----------+
| vidb-postgres-instance-0 | #.#.#.# | Sync Standby | streaming | 3 | 0 |
| vidb-postgres-instance-1 | #.#.#.# | Leader | running | 3 | |
| vidb-postgres-instance-2 | #.#.#.# | Sync Standby | streaming | 3 | 0 |
+--------------------------+-------------+--------------+-----------+----+-----------+

[INFO] === Cluster Health Analysis ===

[INFO] === Checking Disk Space on All Pods ===
[INFO] Checking disk space on pod: vidb-postgres-instance-0
[INFO] Disk Usage: 287M used / 25 total (2% used, 25 available)
[INFO] ✓ Disk space is healthy: 2% used

[INFO] Checking disk space on pod: vidb-postgres-instance-1
[INFO] Disk Usage: 2.3 used / 25 total (10% used, 23 available)
[INFO] ✓ Disk space is healthy: 10% used

[INFO] Checking disk space on pod: vidb-postgres-instance-2
[INFO] Disk Usage: 271M used / 25 total (2% used, 25 available)
[INFO] ✓ Disk space is healthy: 2% used

[INFO] ✓ All pods have sufficient disk space
[INFO] === SeaweedFS Bucket Check ===

[INFO] Checking SeaweedFS in namespace: vmsp-platform
[INFO] Found SeaweedFS master pod: seaweedfs-master-0
[INFO] StatefulSet seaweedfs-master: 1 ready / 1 current / 1 desired
[INFO] StatefulSet seaweedfs-filer: 1 ready / 1 current / 1 desired
[INFO] StatefulSet seaweedfs-volume: 3 ready / 3 current / 3 desired

[INFO] Checking bucket: vidb-external.db.vidb-postgres-instance.wal-backup
[INFO] Bucket usage: 0.16%
[INFO] ✓ Bucket usage is healthy

[INFO] Leader pods: 1
[INFO] Running pods: 1
[INFO] Streaming replicas: 2

[INFO] Leader pod identified: vidb-postgres-instance-1
[INFO] Leader state: running
[INFO] Replicas streaming: 2/2

[INFO] ✓ Database is healthy: Leader is running and all replicas are streaming
[INFO] ✓ Cluster appears healthy

[INFO] ✓ Cluster appears to be in good health

[INFO] Diagnosis completed successfully

========================================================================
[INFO] === DIAGNOSIS COMPLETED SUCCESSFULLY ===
========================================================================

Example 2: Here is an example output of a unhappy path when cluster was unhealthy and one of the replica was in start failed state

Unhappy path

pdixit@MHXPWP4FTH hack % kubectl exec -n vidb-external vidb-postgres-instance-0 -- bash -c "
# Kill postgres and patroni
pkill -9 postgres || true
pkill -9 patroni || true
sleep 2
# Corrupt critical files
rm -rf /home/postgres/pgdata/pgroot/data/pg_control
rm -rf /home/postgres/pgdata/pgroot/data/global/pg_control
rm -rf /home/postgres/pgdata/pgroot/data/pg_wal/*
"
Defaulted container "postgres" out of: postgres, metrics-exporter, walg
pdixit@MHXPWP4FTH hack % ./pg_cluster_health_and_remediation.sh #.#.#.# $PASSWORD vidb
[INFO] Starting PostgreSQL Diagnosis - Remote Mode