Cloud account creation error leaves orphaned resources and 403 error in VMware Cloud Foundation
search cancel

Cloud account creation error leaves orphaned resources and 403 error in VMware Cloud Foundation

book

Article ID: 434820

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

When attempting to create a cloud account from a workload domain in VMware Cloud Foundation (VCF), the creation fails and leaves orphaned compute, network, and storage resources in VMware Aria Automation.

Subsequent attempts to create the cloud account fail with a 403 error stating that the cloud account already exists. Even if the cloud account is removed, the error persists, and a "ghost" cloud account remains visible from VMware Aria Operations. Running a GET request against the /iaas/api/cloud-accounts resource shows only a single cloud account.

Error:"Failed to validate credentials. Error: Unable to validate endpoint of type vsphere with hostname:<vcenter FQDN>{"message":"Cloud account already exists with this identifier","statusCode":400,"errorCode":O,"serverErrorld":"XXXXX-XXXX-XXXXX","documentKind":"com:vmware:xenon:common:ServiceErrorResponse"} (less)"

Environment

VMware Cloud Foundation Automation 9.X

Cause

This issue occurs because the vCenter account registration with the SDDC takes more than two minutes, which exceeds the default 80-second timeout for the client used in VCF. This timeout causes the cloud account creation to fail mid-process, leaving an orphaned endpoint in the database.

Resolution

This issue is currently under review with Broadcom Engineering. Subscribe to this article to receive updates on this issue.

Important Prerequisite:
Before proceeding with the workaround steps, please ensure you have taken a full, native backup of your VCF Automation 9.x cluster. Please be aware that traditional vCenter VM snapshots are not supported for this appliance and should not be used as a rollback method.

Workaround

To resolve the issue, you must increase the client timeout duration and manually clean up the orphaned endpoint ("WLD-vCenterName") from the database.

Part 1: Increase the WebClient Timeout Increase the timeout property to 300 seconds to prevent future timeouts during creation:

-Dcom.vmware.automation.spring.webflux.platform.client.WebClientUtil.request.timeout.duration=PT300S

Part 2: Clean up the Orphaned Endpoint in VCF Automation Important: Ensure you have taken a database backup of provisioning-db and catalog-db before performing the below steps.

  1. SSH into one of the nodes (it does not matter which node you use):

    ssh vmware-system-user@<node ip>
    
  2. Set the shell specified by the SHELL environment variable:

    sudo -s
    
  3. Set KUBECONFIG to access the cluster:

    export KUBECONFIG=/etc/kubernetes/admin.conf
    
  4. Retrieve the database credentials. First, get the username (for example, for the catalog_db owner):

    kubectl get secret catalog-db-owner-user.vcfapostgres.credentials.postgresql.acid.zalan.do -n prelude -o jsonpath='{.data.username}' | base64 -d
    

    Next, get the password:

    kubectl get secret catalog-db-owner-user.vcfapostgres.credentials.postgresql.acid.zalan.do -n prelude -o jsonpath='{.data.password}' | base64 -d
    
  5. Identify the primary pod:

    export PRIMARY_POD=$(kubectl get pods -n prelude -l application=spilo,spilo-role=master -n prelude -o jsonpath='{.items[0].metadata.name}')
    echo "Using primary pod: $PRIMARY_POD"
    
  6. Using the primary pod and credentials, connect to the provisioning_db and execute the following SELECT statement to verify the document_self_link of the endpoint (where WLD-vCenterName is /resources/endpoints/UUID):

    SELECT document_self_link FROM endpoint_state WHERE endpoint_type='vsphere' AND name = 'WLD-vCenterName'; 
  7. If the document_self_link of the endpoint is verified, execute the following SQL statements to clean up the linked records in the provisioning_db:

    DELETE FROM disk_state WHERE endpoint_link = '/resources/endpoints/UUID';
    DELETE FROM network_interface_state WHERE endpoint_link = '/resources/endpoints/UUID';
    DELETE FROM subnet_state WHERE endpoint_link = '/resources/endpoints/UUID';
    DELETE FROM network_state WHERE endpoint_link = '/resources/endpoints/UUID';
    DELETE FROM compute_description WHERE endpoint_link = '/resources/endpoints/UUID';
    DELETE FROM compute_state WHERE endpoint_link = '/resources/endpoints/UUID';
    DELETE FROM storage_description WHERE endpoint_link = '/resources/endpoints/UUID';
    DELETE FROM endpoint_state WHERE endpoint_type='vsphere' AND name = 'WLD-vCenterName'; 
  8. Switch your database session from provisioning_db to catalog_db (e.g., using \c catalog_db if in a continuous psql session).

  9. Once connected to catalog_db, run the following commands to clean up resources linked to the endpoint:

    UPDATE dep_deployment SET description = 'WLD-vCenterName' WHERE id in (select deployment_id from dep_resource where account = 'WLD-vCenterName');
    DELETE FROM dep_search WHERE resource_id IN (select id from dep_resource where account = 'WLD-vCenterName');
    DELETE FROM dep_resource_data WHERE resource_id IN (select id from dep_resource where account = 'WLD-vCenterName');
    DELETE FROM dep_resource WHERE account = 'WLD-vCenterName';
    DELETE FROM dep_deployment WHERE description = 'WLD-vCenterName';
    
  10. Quit the database session.

  11. Log in to the VCF Automation UI and verify that the endpoint "WLD-vCenterName" is deleted and the 403 error no longer persists when creating a new cloud account.