VCF Operations cluster fails to come online and analytics service stops after failed data node expansion (pg_basebackup / pg_hba.conf replication error)
search cancel

VCF Operations cluster fails to come online and analytics service stops after failed data node expansion (pg_basebackup / pg_hba.conf replication error)

book

Article ID: 437931

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

During a VMware Cloud Foundation (VCF) Operations (Aria Operations) deployment, the cluster may become stuck in a degraded or non-functional state when attempting to add a new data node. The issue occurs due to a network disruption between ESXi hosts during node expansion, which results in an incomplete failover between primary and replica nodes and breaks database replication.

You may observe:

  • Cluster expansion (add node) remains stuck or hung
  • Primary and replica nodes remain in “waiting for analytics” state
  • Data node deployment completes but fails to join cluster
  • Replica services such as vpostgres-repl fail to start
  • Cluster does not recover even after vMotioning the new node
  • Logs show replication failures such as:
    • pg_basebackup: error: connection to server failed:
      FATAL: no pg_hba.conf entry for replication connection
  • Analytics service terminates with exit code -1

This can occur after:

  • A failed or partial node expansion
  • Underlying network connectivity issues between ESXi hosts
  • An incomplete failover event between primary and replica nodes

Environment

VMware Cloud Foundation (VCF) 9.x

VCF Operations / Aria Operations 9.x

Cause

The cluster enters an inconsistent state due to a partially completed failover between the primary and primary replica nodes, combined with broken PostgreSQL replication configuration.

Resolution

  1. Remove failed data node attempt
    1. Power off the failed data node VM
    2. Rename or remove the VM (do not reuse it)
  2.  Take cluster offline
    1. Run the following command on a node:
      • $VMWARE_PYTHON_BIN /usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/bin/vcopsClusterManager.py offline-cluster "SUPPORT"
    2. In the Admin UI:
      • Select Force Take Offline
  3. Correct replication configuration
    1. Edit the following file on the current primary node:
      • /storage/db/vcops/vpostgres/repl/pg_hba.conf
    2. Ensure the replica node IP is correctly listed for replication access
  4. Bring cluster online
    1. Bring the cluster back online
    2. Verify services initialize successfully
  5. Stabilize cluster state
    1. Take the cluster offline again
    2. Take powered-off snapshots of:
      • Primary node
      • Replica node
    3. Bring the cluster back online
  6. Clean up failed node references
    1. Navigate to:
      • Fleet Manager → Lifecycle → Operations
    2. Trigger Inventory Sync
    3. Confirm stale node references are removed
  7. Add new data node
    1. Go to:
      • Fleet Manager → Lifecycle → Operations → Manage → Add Nodes
    2. Deploy a new data node on a healthy ESXi host
    3. Confirm successful cluster expansion
  8. Validate and clean up
    1. Verify:
      • Data collection is working
      • Cluster health is green
    2. Remove all snapshots