Deployment of VCF Automation from VCF Installer fails with "connection timed out" and "404 NOT_FOUND" due to 10.244.0.0/16 Subnet Overlap
search cancel

Deployment of VCF Automation from VCF Installer fails with "connection timed out" and "404 NOT_FOUND" due to 10.244.0.0/16 Subnet Overlap

book

Article ID: 440083

calendar_today

Updated On:

Products

VCF Operations VMware SDDC Manager / VCF Installer VCF Automation

Issue/Introduction

When attempting to deploy VCF Automation via the VCF Installer (SDDC Manager), the deployment workflow stalls and ultimately fails.
  • The VCF Installer UI displays the following failure: 
    Retrieve the status of VCF Automation Deployment request - Failed - MM/DD/YY, HH:MM
    Unable to get request status for request <Request_ID> Reference Token: ######

     

  • In the VCF Installer domainmanager.log, a 404 NOT_FOUND error is recorded when querying Fleet Management for the task status: 
    YYYY-MM-DDTHH:MM:SS.###+0000 DEBUG [vcf_dm,<Task_ID>,####] [c.v.e.s.r.c.LoggingHttpRequestInterceptor,dm-exec-##]  Request URI: https://<VCF_FLEET_MANAGER_FQDN>/lcm/request/api/v2/requests/<Request_ID>
    Request method: GET
    Request body:
    Response code: 404 NOT_FOUND

     

  • In the Fleet Management (/var/log/vrlcm/vmsp_bootstrap.log), a connection timeout to the vCenter server is observed: 
    error: failed to create vCenter client: failed to create vsphere client: Post "https://<VC_FQDN>/sdk": dial tcp 10.244.#.###:443: connect: connection timed out

Environment

  • VCF Installer 9.0.x
  • VCF Fleet Management 9.0.x
  • VCF Automation 9.0.x

Cause

  • During the deployment process, the VCF Installer relies on Fleet Management to execute the creation of the VCF Automation components. To facilitate this, Fleet Management spins up a temporary KinD (Kubernetes in Docker) bootstrap cluster.
    By default, this temporary KinD cluster is configured to use the 10.244.0.0/16 subnet for its internal pod network. If physical infrastructure components—such as the target vCenter server—are also located on the 10.244.#.### network, a routing conflict occurs. Traffic destined for vCenter is routed internally within the container network rather than traversing the physical network, resulting in a connection timeout.
  • Because the bootstrap engine cannot communicate with vCenter server, it times out while waiting for the VMSP cluster nodes to be provisioned (e.g., kubernetesclusters/vcf-mgmt-########). Consequently, the VCF Installer receives a 404 NOT_FOUND response when attempting to query the status of the stalled/deleted deployment request, halting the workflow.

Resolution

To resolve this issue, you must clean up the failed deployment, modify the temporary bootstrap network to avoid the subnet overlap, deploy VCF Automation out-of-band, and finally bypass the stuck task in the VCF Installer.
Prior to following the below troubleshooting steps, please ensure to secure snapshots of the SDDC Manager and Fleet Management appliances.
  1. Clean up the failed component
    1. Log in to the VCF Operations (Fleet Management) UI https://<VCF_Ops_FQDN>/vcf-operations/ui/ with the local admin user.
    2. In the left navigation panel, go to Fleet Management > Lifecycle.
    3. In the middle pane, select VCF Management.
    4. Click on Components.
    5. Locate the failed Automation component, click the vertical ellipsis (three dots), and select Delete to clean up the environment.
  2. Apply the Subnet Workaround
    Follow the steps outlined in KB 397561 to modify the bootstrap.sh script on the Fleet Management appliance. This forces the temporary KinD cluster to use an alternate, non-conflicting subnet (e.g., 10.255.0.0/16).
  3. Redeploy VCF Automation
    With the script updated, initiate the VCF Automation deployment directly from the Fleet Management UI. Wait for the deployment to complete successfully before proceeding to the next step.
  4. Skip the failed task in the VCF Installer
    Because the deployment was completed out-of-band in Fleet Management, the VCF Installer workflow is still stuck waiting on the original failed task. You must manually skip this sub-task in the SDDC Manager database. Follow the exact steps and utilize the script provided in KB 425177 to mark the sub-task as successful and resume the workflow.