Postgres pod fails to start in VMware Aria Automation Orchestrator cluster due to configuration file corruption
search cancel

Postgres pod fails to start in VMware Aria Automation Orchestrator cluster due to configuration file corruption

book

Article ID: 423655

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

In a VMware Aria Automation Orchestrator (formerly vRealize Orchestrator) cluster, one or more nodes may become unavailable. Upon investigation, the postgres pod fails to enter a Running state. When checking the pod status via CLI, you observe the following:

kubectl get pods -n prelude
NAME          READY   STATUS    RESTARTS   AGE
postgres-0    1/1     Running   0          108s
postgres-1    1/1     Running   0          108s
postgres-2    0/1     Error     4          108s

Environment

VMware Aria Automation Orchestrator 8.x

Cause

The postgresql.conf configuration file located at /data/live/postgresql.conf has become corrupted or contains invalid parameters. This corruption is often linked to storage latency or file system issues caused by maintaining multiple or aged virtual machine snapshots in the vCenter environment.

Resolution

To resolve this issue, you must restore the affected node to a functional state and re-sync it with the cluster:

  1. Revert Snapshot: Revert the affected Orchestrator node to the most recent known-good snapshot (e.g., a snapshot taken immediately following a successful upgrade).
  2. Remove Node: Remove the affected node from the Orchestrator cluster configuration.
  3. Re-join Node: Re-add the node to the cluster using the standard join process. This allows the node to synchronize its database state and configuration with the healthy peers.
  4. Snapshot Cleanup: Once the cluster is healthy and all pods are in a Running state, delete any old snapshots to prevent future storage latency and potential corruption.

Additional Information

If a valid snapshot is not available, the alternative resolution is to deploy a new node with the same name and IP address and join it to the existing cluster.