Error: "Failed to Start" in UI After Placing Primary Cell in Maintenance Mode
search cancel

Error: "Failed to Start" in UI After Placing Primary Cell in Maintenance Mode

book

Article ID: 402885

calendar_today

Updated On:

Products

VMware Cloud Director

Issue/Introduction

  • Placing the primary VMware Cloud Director cell into Maintenance Mode causes the UI and API to become unresponsive at the public FQDN.
  • The environment appears unavailable until the primary cell is restarted, despite other cells being healthy.

Environment

VMware Cloud Director 10.6.1

Cause

This issue can occur when the Load Balancer health check does not accurately detect that a VMware Cloud Director cell is in Maintenance Mode. Endpoints such as /api/versions may still return HTTP 200 responses, even though the Cloud Director service is no longer serving traffic. As a result, the Load Balancer continues routing requests to a cell that is not operational, triggering “Failed to Start” errors in the UI.

Resolution

  1. Review Load Balancer Backend Configuration
    Identify the backend pool configuration for VMware Cloud Director in your Load Balancer (e.g., HAProxy, NSX-ALB).

  2. Update Health Check to Detect Maintenance Mode
    Modify the health check for each backend VCD node to use the /cloud endpoint:

    option httpchk GET /cloud
    http-check expect status 200

    The /cloud endpoint is a lightweight VCD-native check.
    It returns 200 OK when the Cloud Director service is running and accepting UI/API requests.
    It returns 503 Service Unavailable when the cell is in Maintenance Mode or the service is stopped.
    This allows the Load Balancer to detect that the node is unavailable and remove it from the active pool automatically.


  3.  Apply the Configuration
    Reload or restart your Load Balancer to apply the changes.

    For example (on HAProxy):

    systemctl reload haproxy

  4. Validate Health Check Behavior
    While primary node is in Maintenance Mode, run the following command from any test system:

    curl -k https://<public-fqdn>/cloud

    Expected behavior:

    If traffic is being routed to a healthy node:
    HTTP/1.1 200 OK

    If traffic is still being routed to the node in Maintenance Mode:
    HTTP/1.1 503 Service Unavailable

  5.  

    Confirm UI Availability
    Open the VCD portal via the public FQDN.

    Ensure the login UI loads without the “Failed to Start” error.

Additional Information

This issue is reproducible only when the Load Balancer continues to route to a cell whose vCD service has been intentionally paused (e.g., via Maintenance Mode).