How to remediate an unhealthy postgres cluster.
search cancel

How to remediate an unhealthy postgres cluster.

book

Article ID: 440877

calendar_today

Updated On:

Products

VCF Automation

Issue/Introduction

Sometimes, the PostgreSQL cluster may enter an unhealthy or inconsistent state where the primary node remains active while one or more replica nodes become stuck, or the leader node itself may remain in a stopping, starting, or restarting state.

As a result of this condition, the VCF Identity Broker service on the appliance may become unavailable or fail to function properly. Operations against the service may return "HTTP 500 Internal Server Errors", and the VCF Automation UI may display the message: "no healthy upstream"

This issue typically indicates instability or failure within the underlying PostgreSQL cluster services impacting dependent VCF components.

Environment

VCF 9.0
VCF 9.1 

Cause

This issue may occur due to repeated PostgreSQL node failovers or underlying storage-related problems, which can leave the cluster in an inconsistent state.

In such scenarios, one or more replica nodes may fail to recover or rejoin the cluster automatically, resulting in instability of the PostgreSQL service and dependent VCF components.

Resolution

Step 1: Prerequisites

Ensure you have the following before running the script. 

  • The IP address of a cluster control plane node (NODE_IP).
    - For VCF management services, this can be identified via VCF Operations UI: Navigate to Build → Lifecycle → VCF Services Runtime to find Control Plane Node IPs.
    - For VCF Automation, this can be IP of any nodes of that component.
  • The SSH password for the `vmware-system-user` account
    The component name to diagnose (e.g., `vidb`, `vcfa`)
  • sshpass` installed on your local machine:
    - macOS: `brew install hudochenkov/sshpass/sshpass`
    - Ubuntu/Debian: `sudo apt-get install sshpass`
    - RHEL/CentOS: `sudo yum install sshpass`

Note: The script SSH's into the node VM and uses the on-node kubeconfig automatically. No local kubeconfig setup is required.

Step 2: Check cluster health 

The pg_cluster_health_and_remediation.sh script is designed to automate cluster health checks and resolve common Postgres issues.

pg_cluster_health_and_remediation.sh Expand source

Key Capabilities:

  • Health Monitoring: Identifies disk space usage and  WAL bucket usage.
  • Automated Repairs: Recovers replicas stuck in problematic states (e.g., crashed, stopped, in archive recovery, start failed,  restart failed, starting). It can also repair when leader is in starting, restarting, stopping state given that all other dependent services are up and running. 

How to Execute: To run the script, provide the node VM IP, SSH password, and the component name.

./pg_cluster_health_and_remediation.sh <node_vm_ip> <ssh_password> <component_name>

Example:

./pg_cluster_health_and_remediation.sh #.#.#.# MyPassword123 vidb // for VIDB
./pg_cluster_health_and_remediation.sh #.#.#.# password vcfa // for VCFA

Step 3: Follow the prompt 

The script will ask for confirmation before making any changes:

- Removing readiness probes

- Reinitializing failed replicas

- Restarting PostgreSQL services

Example 1: Here is an example output of a happy path when cluster was healthy 

Happy path

pdixit@MHXPWP4FTH hack % ./pg_cluster_health_and_remediation.sh  #.#.#.# $PASSWORD vidb
[INFO] Starting PostgreSQL Diagnosis - Remote Mode

[INFO] Configuration:
  Node VM IP     :  #.#.#.#
  Component Name : vidb
  SSH User       : vmware-system-user

[INFO] Testing SSH connectivity to  #.#.#.#...
[INFO] SSH connectivity verified

[INFO] Executing diagnosis on remote node...
[INFO] This will run diagnosis on  #.#.#.#

[INFO] Uploading diagnosis script to remote node...
[INFO] Executing diagnosis script on remote node...

[INFO] Retrieving namespace from component: vidb
[INFO] Found namespace: vidb-external

[INFO] Validating environment on remote node...
[INFO] Environment validation passed
[INFO] Namespace: vidb-external
[INFO] Kubeconfig: /etc/kubernetes/admin.conf

[INFO] Discovering PostgreSQL resources...
[INFO] Found PostgreSQL instance: vidb-postgres-instance
[INFO] Found StatefulSet: vidb-postgres-instance
[INFO] Replicas: 3/3 available

[INFO] Finding master pod...
[INFO] Using pod for diagnosis: vidb-postgres-instance-1

[INFO] === PostgreSQL Cluster Status ===

+ Cluster: vidb-postgres-instance (7606501767881945181) +-----------+----+-----------+
| Member                   | Host        | Role         | State     | TL | Lag in MB |
+--------------------------+-------------+--------------+-----------+----+-----------+
| vidb-postgres-instance-0 | #.#.#.# | Sync Standby | streaming |  3 |         0 |
| vidb-postgres-instance-1 | #.#.#.#  | Leader       | running   |  3 |           |
| vidb-postgres-instance-2 | #.#.#.# | Sync Standby | streaming |  3 |         0 |
+--------------------------+-------------+--------------+-----------+----+-----------+

[INFO] === Cluster Health Analysis ===


[INFO] === Checking Disk Space on All Pods ===
[INFO] Checking disk space on pod: vidb-postgres-instance-0
[INFO] Disk Usage: 287M used / 25 total (2% used, 25 available)
[INFO] ✓ Disk space is healthy: 2% used

[INFO] Checking disk space on pod: vidb-postgres-instance-1
[INFO] Disk Usage: 2.3 used / 25 total (10% used, 23 available)
[INFO] ✓ Disk space is healthy: 10% used

[INFO] Checking disk space on pod: vidb-postgres-instance-2
[INFO] Disk Usage: 271M used / 25 total (2% used, 25 available)
[INFO] ✓ Disk space is healthy: 2% used

[INFO] ✓ All pods have sufficient disk space
[INFO] === SeaweedFS Bucket Check ===

[INFO] Checking SeaweedFS in namespace: vmsp-platform
[INFO] Found SeaweedFS master pod: seaweedfs-master-0
[INFO] StatefulSet seaweedfs-master: 1 ready / 1 current / 1 desired
[INFO] StatefulSet seaweedfs-filer: 1 ready / 1 current / 1 desired
[INFO] StatefulSet seaweedfs-volume: 3 ready / 3 current / 3 desired

[INFO] Checking bucket: vidb-external.db.vidb-postgres-instance.wal-backup
[INFO] Bucket usage: 0.16%
[INFO] ✓ Bucket usage is healthy

[INFO] Leader pods: 1
[INFO] Running pods: 1
[INFO] Streaming replicas: 2

[INFO] Leader pod identified: vidb-postgres-instance-1
[INFO] Leader state: running
[INFO] Replicas streaming: 2/2

[INFO] ✓ Database is healthy: Leader is running and all replicas are streaming
[INFO] ✓ Cluster appears healthy

[INFO] ✓ Cluster appears to be in good health

[INFO] Diagnosis completed successfully

========================================================================
[INFO] === DIAGNOSIS COMPLETED SUCCESSFULLY ===
========================================================================

Example 2: Here is an example output of a unhappy path when cluster was unhealthy and one of the replica was in start failed state

Unhappy path

pdixit@MHXPWP4FTH hack % kubectl exec -n vidb-external vidb-postgres-instance-0 -- bash -c "
# Kill postgres and patroni
pkill -9 postgres || true
pkill -9 patroni || true
sleep 2
# Corrupt critical files
rm -rf /home/postgres/pgdata/pgroot/data/pg_control
rm -rf /home/postgres/pgdata/pgroot/data/global/pg_control
rm -rf /home/postgres/pgdata/pgroot/data/pg_wal/*
"
Defaulted container "postgres" out of: postgres, metrics-exporter, walg
pdixit@MHXPWP4FTH hack % ./pg_cluster_health_and_remediation.sh  #.#.#.# $PASSWORD vidb
[INFO] Starting PostgreSQL Diagnosis - Remote Mode

[INFO] Configuration:
  Node VM IP     :  #.#.#.#
  Component Name : vidb
  SSH User       : vmware-system-user

[INFO] Testing SSH connectivity to  #.#.#.#...
[INFO] SSH connectivity verified

[INFO] Executing diagnosis on remote node...
[INFO] This will run diagnosis on  #.#.#.#

[INFO] Uploading diagnosis script to remote node...
[INFO] Executing diagnosis script on remote node...

[INFO] Retrieving namespace from component: vidb
[INFO] Found namespace: vidb-external

[INFO] Validating environment on remote node...
[INFO] Environment validation passed
[INFO] Namespace: vidb-external
[INFO] Kubeconfig: /etc/kubernetes/admin.conf

[INFO] Discovering PostgreSQL resources...
[INFO] Found PostgreSQL instance: vidb-postgres-instance
[INFO] Found StatefulSet: vidb-postgres-instance
[INFO] Replicas: 3/3 available

[INFO] Finding master pod...
[INFO] Using pod for diagnosis: vidb-postgres-instance-1

[INFO] === PostgreSQL Cluster Status ===

+ Cluster: vidb-postgres-instance (7606501767881945181) +--------------+----+-----------+
| Member                   | Host        | Role         | State        | TL | Lag in MB |
+--------------------------+-------------+--------------+--------------+----+-----------+
| vidb-postgres-instance-0 |  #.#.#.# | Replica      | start failed |    |   unknown |
| vidb-postgres-instance-1 |  #.#.#.#  | Leader       | running      |  3 |           |
| vidb-postgres-instance-2 |  #.#.#.#  | Sync Standby | streaming    |  3 |         0 |
+--------------------------+-------------+--------------+--------------+----+-----------+

[INFO] === Cluster Health Analysis ===

[INFO] === Checking Disk Space on All Pods ===
[INFO] Checking disk space on pod: vidb-postgres-instance-0
[INFO] Disk Usage: 271M used / 25 total (2% used, 25 available)
[INFO] ✓ Disk space is healthy: 2% used

[INFO] Checking disk space on pod: vidb-postgres-instance-1
[INFO] Disk Usage: 2.3 used / 25 total (10% used, 23 available)
[INFO] ✓ Disk space is healthy: 10% used

[INFO] Checking disk space on pod: vidb-postgres-instance-2
[INFO] Disk Usage: 271M used / 25 total (2% used, 25 available)
[INFO] ✓ Disk space is healthy: 2% used

[INFO] ✓ All pods have sufficient disk space
[INFO] === SeaweedFS Bucket Check ===

[INFO] Checking SeaweedFS in namespace: vmsp-platform
[INFO] Found SeaweedFS master pod: seaweedfs-master-0
[INFO] StatefulSet seaweedfs-master: 1 ready / 1 current / 1 desired
[INFO] StatefulSet seaweedfs-filer: 1 ready / 1 current / 1 desired
[INFO] StatefulSet seaweedfs-volume: 3 ready / 3 current / 3 desired

[INFO] Checking bucket: vidb-external.db.vidb-postgres-instance.wal-backup
[INFO] Bucket usage: 0.16%
[INFO] ✓ Bucket usage is healthy

[INFO] Leader pods: 1
[INFO] Running pods: 1
[INFO] Streaming replicas: 1

[INFO] Leader pod identified: vidb-postgres-instance-1
[INFO] Leader state: running
[INFO] Replicas streaming: 1/2

[WARNING]   - Replica vidb-postgres-instance-0 is in remediable state: start failed
[WARNING]   - Replica vidb-postgres-instance-0 state: start failed
[WARNING] Leader is running but replicas are not all streaming (1/2)

[WARNING] Found 1 replica(s) in remediable state(s).

Do you want to reinitialize these replicas? (yes/no): yes <--- ASK FOR PERMISSION
[INFO] Cluster name: vidb-postgres-instance
[INFO] Reinitializing replica: vidb-postgres-instance-0 (current state: start failed)
[INFO] Using pod: vidb-postgres-instance-1
[INFO] Reinit command sent for vidb-postgres-instance-0. Waiting for it to reach streaming state (timeout: 20 minutes)...
[INFO]   vidb-postgres-instance-0 state: streaming (elapsed: 0 min)
[INFO] ✓ vidb-postgres-instance-0 is now streaming!

[INFO] === Final Cluster Status After Remediation ===
+ Cluster: vidb-postgres-instance (7606501767881945181) +-----------+----+-----------+
| Member                   | Host        | Role         | State     | TL | Lag in MB |
+--------------------------+-------------+--------------+-----------+----+-----------+
| vidb-postgres-instance-0 |  #.#.#.# | Replica      | streaming |  3 |         0 |
| vidb-postgres-instance-1 |  #.#.#.#  | Leader       | running   |  3 |           |
| vidb-postgres-instance-2 |  #.#.#.#  | Sync Standby | streaming |  3 |         0 |
+--------------------------+-------------+--------------+-----------+----+-----------+

[INFO] ✓ All replicas are now streaming. Database is healthy!
[INFO] ✓ Cluster appears healthy

[INFO] ✓ Cluster appears to be in good health

[INFO] Diagnosis completed successfully

========================================================================
[INFO] === DIAGNOSIS COMPLETED SUCCESSFULLY ===
========================================================================

If you have more than one replica in a remediable state, run the script again to remediate another replica.

Step 4: Verify cluster status

From the last few lines of the above output, you can see the database is healthy with 1 leader running and 2 replicas streaming.

Alternatively to Manually execute reinit commands on replica nodes: (applicable for versions such as vCF 9.0)

1. One you have the kubeconfig set, find the postgres leader pod

$ kubectl get pod -n prelude -lapplication=spilo,spilo-role=master

NAME             READY   STATUS    RESTARTS   AGE
vcfapostgres-1   3/3     Running   0          53m

The patronictl commands will be executed against the leader pod.

2. Verify that the postgres leader is Running. If it is not, it is not safe to perform the remediation.

$ kubectl exec vcfapostgres-1 -n prelude -- patronictl list

Defaulted container "postgres" out of: postgres, metrics-exporter, walg
+ Cluster: vcfapostgres (7545852715054780503) -+-----------+----+-----------+
| Member         | Host         | Role         | State     | TL | Lag in MB |
+----------------+--------------+--------------+-----------+----+-----------+
| vcfapostgres-0 | 198.#.#.#    | Replica      | starting  |    |   unknown | 
| vcfapostgres-1 | 198.#.#.#   | Leader       | running   |  2 |           |
| +--------------+--------------+--------------+-----------+----+-----------+

Here vcfapostgres-1 is the Leader and the State=running so we may continue.

NOTE: If both Replicas are listed, only Replicas NOT in the streaming state need to be re-initialized. If only one Replica is listed,
as in this example, re-initializing it will allow the second Replica to start once the first Replica starts.

3. Re-initialize the stuck Replica.

The form of the patronictl command is:  patronictl reinit vcfapostgres <replica name> --force

DO NOT SPECIFY THE LEADER NAME (vcfapostgres-1 in this example) OR THE POSTGRES CLUSTER WILL LIKELY BE BROKEN

$ kubectl exec vcfapostgres-1 -n prelude -- patronictl reinit vcfapostgres vcfapostgres-0 --force

4. Periodically execute Step 2 until the Replica is in the running state. At this point you will either reinit the other Replica or simply wait for it to start if it was not listed.

 

Attachments

pg_cluster_heath_and_remediation.sh get_app