Aria Suite Lifecycle (LCM) patch deployment fails during the "Remediate Postgres Cluster" or CSP patching stage.
PostgreSQL nodes report fatal WAL gaps: requested WAL segment has already been removed.
The auto-recovery.sh loop rapidly and repeatedly wipes the /db/data directory on replica nodes.
Logs (serverlog or postgresql.log) display path execution errors referencing /opt/vmware/vpostgres/14/, despite the cluster actively running the vPostgres 9.6 engine.
pcp_recovery_node operations fail with timeouts (default 90s) before large database transfers can achieve consistency.
VMware Identity Manager 3.3.7
VMware Aria Suite Lifecycle 8.18.0
This issue is triggered by an out-of-order patching sequence. During a failed or misaligned upgrade cycle, an LCM Day 2 operation prematurely updates the vIDM cluster's primary node with newer orchestration scripts designed for PostgreSQL 14.
Because the underlying database engine has not yet been upgraded from 9.6, these new scripts brand PostgreSQL 14 path references onto the 9.6 installation. This mismatch causes the cluster's high-availability and auto-recovery mechanisms to behave sporadically, ultimately resulting in failed SSH remote executions and collapsed replication streams.
The specific orchestration files impacted are:
/db/data/recovery_1st_stage
/usr/local/etc/follow_master.sh
/usr/local/etc/failover.sh
/usr/local/etc/auto-recovery.sh
/usr/local/etc/aliases
To resolve the version mismatch, replace the corrupted scripts on the primary node with unmodified baseline files from a General Availability (GA) environment.
service NetworkService stop
service pgService stoprm -f /usr/local/etc/LCM_DISABLE_AUTO_RECOVERY/db/data/recovery_1st_stage/usr/local/etc/follow_master.sh/usr/local/etc/failover.sh/usr/local/etc/auto-recovery.sh/usr/local/etc/aliasespostgres user retains execution permissions on the replaced files.recovery_1st_stage script was copied from a foreign cluster, its hardcoded authentication variables are invalid. Open /db/data/recovery_1st_stage with vi on the primary node and update the password string to match the current environment's pgpool database password.The default 90-second timeout in Pgpool is often insufficient for databases exceeding 10GB–20GB. You may need to extend this timeout to allow the pg_basebackup block transfer and subsequent WAL replay to finish before the pcp_recovery
/usr/local/etc/pgpool.conf.recovery_timeout and increase the value significantly (e.g., 600 or 1200 seconds).recovery_timeout = 1200With the scripts aligned to 9.6 and the timeouts expanded, you must completely rebuild the replica nodes to clear any WAL gaps.
rm -rf /db/data/*pcp_recovery_node -w -h localhost -U pgpool -p 9898 -n <node_id>pcp_attach_node -w -h localhost -U pgpool -p 9898 -n <node_id>poolnodes on each node and ensure they are all in sync. Each node should show as UP.su - postgres -c "psql -x -c 'SELECT * FROM pg_stat_replication';"Ensure the state returns as streaming.pcp_node_info -w -h localhost -U pgpool -p 9898 -n 0
pcp_node_info -w -h localhost -U pgpool -p 9898 -n 1
pcp_node_info -w -h localhost -U pgpool -p 9898 -n 2