Recovering vIDM PostgreSQL Cluster from Orchestration Script Mismatch and WAL Gaps due to out-of-order patching

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

Aria Suite Lifecycle (LCM) patch deployment fails during the "Remediate Postgres Cluster" or CSP patching stage.
PostgreSQL nodes report fatal WAL gaps: requested WAL segment has already been removed.
The auto-recovery.sh loop rapidly and repeatedly wipes the /db/data directory on replica nodes.
Logs (serverlog or postgresql.log) display path execution errors referencing /opt/vmware/vpostgres/14/, despite the cluster actively running the vPostgres 9.6 engine.
pcp_recovery_node operations fail with timeouts (default 90s) before large database transfers can achieve consistency.

Environment

VMware Identity Manager 3.3.7

VMware Aria Suite Lifecycle 8.18.0

Cause

This issue is triggered by an out-of-order patching sequence. During a failed or misaligned upgrade cycle, an LCM Day 2 operation prematurely updates the vIDM cluster's primary node with newer orchestration scripts designed for PostgreSQL 14.

Because the underlying database engine has not yet been upgraded from 9.6, these new scripts brand PostgreSQL 14 path references onto the 9.6 installation. This mismatch causes the cluster's high-availability and auto-recovery mechanisms to behave sporadically, ultimately resulting in failed SSH remote executions and collapsed replication streams.

The specific orchestration files impacted are:

/db/data/recovery_1st_stage
/usr/local/etc/follow_master.sh
/usr/local/etc/failover.sh
/usr/local/etc/auto-recovery.sh
/usr/local/etc/aliases

Resolution

Note: This scenario assumes the vIDM 3.3.7 cluster is still firmly on a 3.3.7 GA build with no CSP patches run against any nodes.

Prerequisites

Run the following KB to temporarily restore delegateIP to the working primary node in the broken cluster: Emergency Cluster Bypass for VMware Identity Manager due to Out-of-Order Patching.
- This will bring services up while the replica nodes are restored using the below instructions, allowing users to still access the environment before the patching operations run in Phase 4.4.

Phase 1: Service Halt and Baseline Script Restoration

To resolve the version mismatch, replace the corrupted scripts on the primary node with unmodified baseline files from a General Availability (GA) environment.

Halt Cluster Services: Execute the following on all nodes in the cluster to prevent further data directory corruption.
```
service NetworkService stop
service pgService stop
```

Remove LCM_DISABLE_AUTO_RECOVERY:

rm -f /usr/local/etc/LCM_DISABLE_AUTO_RECOVERY

Source Baseline Files: Locate a known-good, baseline vIDM 3.3.7 GA cluster.
Replace Corrupted Files: Copy the following files from the GA primary node and overwrite the corrupted files on your impacted primary node:
- /db/data/recovery_1st_stage
- /usr/local/etc/follow_master.sh
- /usr/local/etc/failover.sh
- /usr/local/etc/auto-recovery.sh
- /usr/local/etc/aliases
Verify Permissions: Ensure the postgres user retains execution permissions on the replaced files.
Inject Environment Credentials: Because the newly replaced recovery_1st_stage script was copied from a foreign cluster, its hardcoded authentication variables are invalid. Open /db/data/recovery_1st_stage with vi on the primary node and update the password string to match the current environment's pgpool database password.

Phase 2: Configuration of Replication Timeouts

The default 90-second timeout in Pgpool is often insufficient for databases exceeding 10GB–20GB. You may need to extend this timeout to allow the pg_basebackup block transfer and subsequent WAL replay to finish before the pcp_recovery

Edit Pgpool Configuration: On the primary node, open /usr/local/etc/pgpool.conf.
Adjust Timeout: Locate recovery_timeout and increase the value significantly (e.g., 600 or 1200 seconds).
```
recovery_timeout = 1200
```

Phase 3: Forced Re-Synchronization and Recovery

With the scripts aligned to 9.6 and the timeouts expanded, you must completely rebuild the replica nodes to clear any WAL gaps.

Sanitize Replica Targets (Replica Nodes): Clear the corrupted data directories.
```
rm -rf /db/data/*
```
Trigger Full Recovery (Primary Node): Execute the recovery command for each down node.
```
pcp_recovery_node -w -h localhost -U pgpool -p 9898 -n <node_id>
```
Attach Recovered Nodes (Primary Node): Once the recovery completes, logically attach the nodes back to the Pgpool matrix.
```
pcp_attach_node -w -h localhost -U pgpool -p 9898 -n <node_id>
```

Phase 4: Validation and Patch Resumption

Run the command poolnodes on each node and ensure they are all in sync. Each node should show as UP.
Validate Streaming: Confirm the replica nodes are actively receiving data.
```
su - postgres -c "psql -x -c 'SELECT * FROM pg_stat_replication';"
```
Ensure the state returns as streaming.

Validate Quorum: Verify all nodes are reporting Status 2 (ONLINE).

pcp_node_info -w -h localhost -U pgpool -p 9898 -n 0
pcp_node_info -w -h localhost -U pgpool -p 9898 -n 1
pcp_node_info -w -h localhost -U pgpool -p 9898 -n 2

Resume Patching: With the environment stabilized, WAL gaps cleared, and scripts aligned to spec, return to patching the vIDM appliances with the CSP-102547 Patch.