UI not available after automated stuck backup

Products

VMware Cloud Director

Issue/Introduction

VMware Cloud Director (VCD) services fail to start, and both the Provider Portal and the Appliance Management Interface (VAMI) are inaccessible. The underlying PostgreSQL High Availability (HA) database cluster reports synchronization failures.

Symptoms:

VCD Portal returns "503 Service Unavailable" or "404 Not Found" or "HTTP ERROR 404 JSP file [/error.jsp] not found"
VAMI shows database status as Read_Only_Primary or Unknown.
vCenter Server displays multiple stuck tasks for backups on VCD and NFS VMs, triggered by a 3rd party software

Environment

10.x

Cause

The issue is caused by simultaneous image-level backups of all VCD appliance nodes and the shared NFS storage. When multiple snapshots are triggered at the same time, the resulting I/O "stun" or latency exceeds the PostgreSQL HA (repmgr) heartbeat timeout. This leads to a cluster-wide sync failure where nodes cannot determine the Primary state, effectively shutting down the VCD services to prevent data corruption.

Prevent Recurrence:
- Exclude Storage: Do not snapshot the NFS storage VMs simultaneously with the VCD Cells.
- Use Native Backup: Transition to the VCD native appliance backup (VAMI-based) for database-consistent protection without I/O stun.

Resolution

Clear Stuck vCenter Tasks:
- Manual cancel the tasks on vCenter
- Identify the ESXi host where the VCD Cell VMs are registered.
- Restart management agents on the ESXi host if the snapshot tasks cannot be cancelled via the vSphere Client:
```
services.sh restart
```
Consolidate Disks:
- Right-click each VCD Cell VM > Snapshots > Consolidate.
- Ensure all orphaned snapshots from previous backup attempts are removed.

Recover PostgreSQL HA Cluster:

SSH to each VCD Cell as root.

Check the cluster status:

sudo -i -u postgres /opt/vmware/vpostgres/10/bin/repmgr -f /opt/vmware/vpostgres/10/etc/repmgr.conf cluster show

ssh to the out of sync node
stop vpostgres service:
```
systemctl stop vpostgres.service
```

Delete stale DB data:

rm -Rrf /var/vmware/vpostgres/current/pgdata

Clone DB from the primary (use its eth1 IP):

sudo -i -u postgres /opt/vmware/vpostgres/current/bin/repmgr -h <primary_database_IP> -U repmgr -d repmgr -f /opt/vmware/vpostgres/current/etc/repmgr.conf standby clone

Start the DB service:
```
systemctl start vpostgres.service
```

Add the node to repmgr cluster:

sudo -i -u postgres /opt/vmware/vpostgres/current/bin/repmgr -h <primary_database_IP> -U repmgr -d repmgr -f /opt/vmware/vpostgres/current/etc/repmgr.conf standby register --force

Confirm if the VAMI page is now showing the cluster green and healthy

Additional Information

If VAMI page shows "INDETERMINATE" after applying fix check INDETERMINATE Cluster failover status
After recovering the cluster it is suggested to switchover the primary cell back to the original