VCF Operations cluster fails to come online and analytics service stops after failed data node expansion (pg_basebackup / pg

search cancel

VCF Operations cluster fails to come online and analytics service stops after failed data node expansion (pg_basebackup / pg_hba.conf replication error)

book

Article ID: 437931

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

During a VMware Cloud Foundation (VCF) Operations (Aria Operations) deployment, the cluster may become stuck in a degraded or non-functional state when attempting to add a new data node. The issue occurs due to a network disruption between ESXi hosts during node expansion, which results in an incomplete failover between primary and replica nodes and breaks database replication.

You may observe:

Cluster expansion (add node) remains stuck or hung
Primary and replica nodes remain in “waiting for analytics” state
Data node deployment completes but fails to join cluster
Replica services such as vpostgres-repl fail to start
Cluster does not recover even after vMotioning the new node
Logs show replication failures such as:
- pg_basebackup: error: connection to server failed:
  FATAL: no pg_hba.conf entry for replication connection
Analytics service terminates with exit code -1

This can occur after:

A failed or partial node expansion
Underlying network connectivity issues between ESXi hosts
An incomplete failover event between primary and replica nodes

Environment

VMware Cloud Foundation (VCF) 9.x

VCF Operations / Aria Operations 9.x

Cause

The cluster enters an inconsistent state due to a partially completed failover between the primary and primary replica nodes, combined with broken PostgreSQL replication configuration.

Resolution

Remove failed data node attempt
1. Power off the failed data node VM
2. Rename or remove the VM (do not reuse it)
Take cluster offline
1. Run the following command on a node:
  - $VMWARE_PYTHON_BIN /usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/bin/vcopsClusterManager.py offline-cluster "SUPPORT"
2. In the Admin UI:
  - Select Force Take Offline
Correct replication configuration
1. Edit the following file on the current primary node:
  - /storage/db/vcops/vpostgres/repl/pg_hba.conf
2. Ensure the replica node IP is correctly listed for replication access
Bring cluster online
1. Bring the cluster back online
2. Verify services initialize successfully
Stabilize cluster state
1. Take the cluster offline again
2. Take powered-off snapshots of:
  - Primary node
  - Replica node
3. Bring the cluster back online
Clean up failed node references
1. Navigate to:
  - Fleet Manager → Lifecycle → Operations
2. Trigger Inventory Sync
3. Confirm stale node references are removed
Add new data node
1. Go to:
  - Fleet Manager → Lifecycle → Operations → Manage → Add Nodes
2. Deploy a new data node on a healthy ESXi host
3. Confirm successful cluster expansion
Validate and clean up
1. Verify:
  - Data collection is working
  - Cluster health is green
2. Remove all snapshots

Feedback

thumb_up Yes

thumb_down No