UI not available after automated stuck backup
search cancel

UI not available after automated stuck backup

book

Article ID: 424921

calendar_today

Updated On:

Products

VMware Cloud Director

Issue/Introduction

VMware Cloud Director (VCD) services fail to start, and both the Provider Portal and the Appliance Management Interface (VAMI) are inaccessible. The underlying PostgreSQL High Availability (HA) database cluster reports synchronization failures.

Symptoms:

  • VCD Portal returns "503 Service Unavailable" or "404 Not Found" or "HTTP ERROR 404 JSP file [/error.jsp] not found"

  • VAMI shows database status as Read_Only_Primary or Unknown.

  • vCenter Server displays multiple stuck tasks for backups on VCD and NFS VMs, triggered by a 3rd party software 

Environment

10.x

Cause

The issue is caused by simultaneous image-level backups of all VCD appliance nodes and the shared NFS storage. When multiple snapshots are triggered at the same time, the resulting I/O "stun" or latency exceeds the PostgreSQL HA (repmgr) heartbeat timeout. This leads to a cluster-wide sync failure where nodes cannot determine the Primary state, effectively shutting down the VCD services to prevent data corruption.

  • Prevent Recurrence:

    • Exclude Storage: Do not snapshot the NFS storage VMs simultaneously with the VCD Cells.

    • Use Native Backup: Transition to the VCD native appliance backup (VAMI-based) for database-consistent protection without I/O stun.

Resolution

 

  • Clear Stuck vCenter Tasks:

    • Manual cancel the tasks on vCenter
    • Identify the ESXi host where the VCD Cell VMs are registered.

    • Restart management agents on the ESXi host if the snapshot tasks cannot be cancelled via the vSphere Client

      services.sh restart
  • Consolidate Disks:

    • Right-click each VCD Cell VM > Snapshots > Consolidate.

    • Ensure all orphaned snapshots from previous backup attempts are removed.

  • Recover PostgreSQL HA Cluster:

    1. SSH to each VCD Cell as root.

    2. Check the cluster status: 

      sudo -i -u postgres /opt/vmware/vpostgres/10/bin/repmgr -f /opt/vmware/vpostgres/10/etc/repmgr.conf cluster show
    3. ssh to the out of sync node

    4. stop vpostgres service:

      systemctl stop vpostgres.service
    5. Delete stale DB data:

      rm -Rrf /var/vmware/vpostgres/current/pgdata
    6. Clone DB from the primary (use its eth1 IP):
      sudo -i -u postgres /opt/vmware/vpostgres/current/bin/repmgr -h <primary_database_IP> -U repmgr -d repmgr -f /opt/vmware/vpostgres/current/etc/repmgr.conf standby clone
    7. Start the DB service:
      systemctl start vpostgres.service
    8. Add the node to repmgr cluster:
      sudo -i -u postgres /opt/vmware/vpostgres/current/bin/repmgr -h <primary_database_IP> -U repmgr -d repmgr -f /opt/vmware/vpostgres/current/etc/repmgr.conf standby register --force
    9. Confirm if the VAMI page is now showing the cluster green and healthy

 

Additional Information