Failed to deploy NSX-T Edge From SDDC Manager with VM_DEPLOYMENT_FAILED error
search cancel

Failed to deploy NSX-T Edge From SDDC Manager with VM_DEPLOYMENT_FAILED error

book

Article ID: 327080

calendar_today

Updated On:

Products

VMware Cloud Foundation

Issue/Introduction

  • Deployment of Edge Cluster fails with VM_DEPLOYMENT_FAILED error.
  • We see the edge node creation task begin here and we can see it in the location : /var/log/vmware/vcf/domainmanager/domainmanager.log
    YYYY-MM-DDTHH:SS:MS INFO [vcf_dm,] [c.v.e.s.c.s.a.t.TaskAggregatorAdapterImpl,http-nio-x.x.x.x-exec-2] Registering the task {"creationTime":yyyyyyyyyyyyy,"taskId":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx12c","taskModel":"FSM","taskRetry":{"errorCodes":[404,500,501],"method":"PATCH","successCode":202,"url":"http://localhost/domainmanager/workflows/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx12c"},"taskType":"NSXT_EDGECLUSTER_CREATION","taskURL":"http://localhost/domainmanager/workflows/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx12c"}
  • Subsequently ping tests to one of the edge nodes (or possible the VIP IP) fails and we can find the snippets on the location : /var/log/vmware/vcf/domainmanager/domainmanager.log
    YYYY-MM-DDTHH:SS:MS DEBUG [vcf_dm,] [c.v.v.n.h.NsxtEdgeClusterValidationUtil,dm-exec-7] Network pool overlaps: NoneYYYY-MM-DDTHH:SS:MS.DEBUG [vcf_dm,] [c.v.e.s.c.util.HostValidationUtil,dm-exec-7] Trying to ping to x.x.x.x
    
    DEBUG [vcf_dm,] [c.v.e.s.c.util.HostValidationUtil,dm-exec-7] Verify ping connectivity to x.x.x.x with command ping x.x.x.x.x -c  DEBUG [vcf_dm,] [c.v.e.s.c.util.LocalProcessService,dm-exec-7] Executing the Local command: ping x.x.x.x -c  DEBUG [vcf_dm,0000000000000000,0000] [c.v.v.s.c.s.SecurityConfigurationServiceImpl,pool-1-thread-1] Security config retrieved {"certificateValidationEnabled":false,"fipsMode":false} DEBUG [vcf_dm,0000000000000000,0000] [c.v.v.secure.config.LazyTrustManager,pool-1-thread-1] Check if cert validation is enabled false2 DEBUG [vcf_dm,] [c.v.v.n.h.NsxtEdgeClusterValidationUtil,pool-3-thread-41] PING x.x.x.x 56(84) bytes of data.
    
    YYYY-MM-DDTHH:SS:MS DEBUG [vcf_dm,] [c.v.v.n.h.NsxtEdgeClusterValidationUtil,pool-3-thread-41]YYYY-MM-DDTHH:SS:MS DEBUG [vcf_dm,] [c.v.v.n.h.NsxtEdgeClusterValidationUtil,pool-3-thread-41] --- x.x.x.x ping statistics ---YYYY-MM-DDTHH:SS:MS DEBUG [vcf_dm,] [c.v.v.n.h.NsxtEdgeClusterValidationUtil,pool-3-thread-41] 5 packets transmitted, 0 received, 100% packet loss, time 1000ms DEBUG [vcf_dm,] [c.v.v.n.h.NsxtEdgeClusterValidationUtil,pool-3-thread-41]
    
     ERROR [vcf_dm,] [c.v.e.s.c.util.LocalProcessService,dm-exec-7] Local Command Failed with exit value 1.Output Logs :LocalProcess Output: YYYY-MM-DDTHH:SS:MS - PING x.x.x.x  56(84) bytes of data.LocalProcess Output: LocalProcess Output:  --- x.x.x.x ping statistics ---LocalProcess Output: YYYY-MM-DDTHH:SS:MS - 5 packets transmitted, 0 received, 100% packet loss, time 1000ms.
  • Then the workflow is deemed a failure as the nodes cannot be contacted as the Edge node OVF did not deploy:
    YYYY-MM-DDTHH:SS:MS DEBUG [vcf_dm,] [c.v.v.c.n.s.c.c.ApiConnection,dm-exec-18] Closed ApiClient connection.YYYY-MM-DDTHH:SS:MS ERROR [vcf_dm,] [c.v.v.c.f.p.n.a.CreateNsxtEdgeNodeVmAction,dm-exec-ab] Edge node creation failed, node state is pending, VM deployment state is VM_DEPLOYMENT_FAILED
    
    YYYY-MM-DDTHH:SS:MS DEBUG [vcf_dm,] [c.v.v.c.n.s.c.c.ApiConnection,dm-exec-18] Closed ApiClient connection.YYYY-MM-DDTHH:SS:MS ERROR [vcf_dm,] [c.v.e.s.o.model.error.ErrorFactory,dm-exec-18] DEPLOY_NSXT_EDGE_FAILED Failed to deploy NSX-T Edge xxxx on <nsx manager>: Failed to deploy NSX-T Edge xxxx on <nsx manager>.
    
            at com.vmware.vcf.common.fsm.plugins.nsxt.action.CreateNsxtEdgeNodeVmAction.execute(CreateNsxtEdgeNodeVmAction.java:438)
            at com.vmware.vcf.common.fsm.plugins.nsxt.action.CreateNsxtEdgeNodeVmAction.execute(CreateNsxtEdgeNodeVmAction.java:59)
            at com.vmware.evo.sddc.orchestrator.platform.action.FsmActionState.invoke(FsmActionState.java:62)
            at com.vmware.evo.sddc.orchestrator.platform.action.FsmActionPlugin.invoke(FsmActionPlugin.java:159) 
    
    Caused by: java.lang.IllegalArgumentException: Edge node slopvmanedgeaf302 creation failed, node state is pending, VM deployment state is VM_DEPLOYMENT_FAILED.
  • The SDDC then tries to delete the failed Edge Cluster object, unsuccessfully:
    YYYY-MM-DDTHH:SS:MS DEBUG [vcf_dm,] [c.v.v.c.n.s.c.c.NsxtManagerTransportNodeOperations,dm-exec-16] Error occurred while trying to get transport node xyz02.com: Unable to find transport node with name xyz02.com
    YYYY-MM-DDTHH:SS:MS DEBUG [vcf_dm,] [c.v.v.c.f.p.n.a.CreateNsxtEdgeNodeVmAction,dm-exec-16] Getting state for edge node xyz02.com
    YYYY-MM-DDTHH:SS:MS DEBUG [vcf_dm,] [c.v.v.c.n.s.c.c.ApiConnection,dm-exec-16] Closed ApiClient connection.
    YYYY-MM-DDTHH:SS:MS DEBUG [vcf_dm,] [c.v.v.c.f.p.n.h.NsxtCommonOperations,dm-exec-16] Timeout waiting for Edge node xyz02.com to be deleted
    YYYY-MM-DDTHH:SS:MS DEBUG [vcf_dm,] [c.v.v.c.f.p.n.a.CreateNsxtEdgeNodeVmAction,dm-exec-16] Edge node xyz02.com still exists
    YYYY-MM-DDTHH:SS:MS DEBUG [vcf_dm,] [c.v.v.c.n.s.c.c.ApiConnection,dm-exec-16] Closed ApiClient connection.
    YYYY-MM-DDTHH:SS:MS ERROR [vcf_dm,] [c.v.e.s.o.model.error.ErrorFactory,dm-exec-16]  DEPLOY_NSXT_EDGE_UNDO_FAILED Failed to undo NSX-T Edge nsx manager
    com.vmware.evo.sddc.orchestrator.exceptions.OrchTaskException: Failed to undo NSX-T Edge deployment on nsxmanager
            at com.vmware.vcf.common.fsm.plugins.nsxt.action.CreateNsxtEdgeNodeVmAction.lambda$undo$2(CreateNsxtEdgeNodeVmAction.java:528)
  • The cause of the OVF deployment failure is 'Host did not have any virtual network defined':
    DEBUG [vcf_dm,] [c.v.v.c.f.p.n.h.NsxtCommonOperations,dm-exec-18] Finished waiting for Edge node xyz to become ready, currentState is {"details":[{"failureMessage":"Waiting for edge node to be ready.","state":"pending","subSystemId":"","subSystemType":"Host","__dynamicStructureFields":{"fields":{},"name":"struct"}}],"state":"pending","maintenanceModeState":"DISABLED","nodeDeploymentState":{"failureCode":16020,"failureMessage":"Ovf deploy for vm xyz02 failed on vc: Host did not have any virtual network defined.","state":"VM_DEPLOYMENT_FAILED","__dynamicStructureFields":{"fields":{},"name":"struct"}},"transportNodeId":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx12c","__dynamicStructureFields":{"fields":{},"name":"struct"}}
  • Running the following curls on the SDDC manager shows that a historical NSXT upgrade is incomplete:
    curl -k -s -u 'admin' -X GET https://<nsxmanager_vip_ip>/upgrade-coordinator/api/v1/upgrade/history
    
    or
    
    curl -k -s -u 'admin' -X GET https://<nsxmanager_vip_ip>/api/v1/upgrade/summary

    Sample output

    {
    
        "initial_version": "3.1.3.7.0.19380457",
    
        "target_version": "3.2.1.2.0.20541212",
    
        "timestamp": 1683276202494,
    
        "upgrade_status": "STARTED" <----!!!!
    
       }

 

Environment

VMware Cloud Foundation 4.x

Cause

This appears to be a niche case where the NSXT target upgrade is 3.2.1.2.0 or higher and the source NSXT version is 3.1.3.7.0 

3.2.1.2 Edge OVF defines 1 more network than Manager (at version "node_version": "3.1.3.7.0.19380482",) is aware of.

So when the NSXT is in this "upgrading" state, the NSXT expects the OVF used to deploy the edge node(s) to include 5 NICs.

As the NSXT is still essentially at version 3.1.3.7.0, the OVF used to deploy the edge node only has a network configuration for 4 NICs.

 

Resolution

Workaround:

  1. Create a standard switch on ALL hosts in the target cluster and create a portgroup named 'VM Network" (no need for any uplinks).
  2. Try the Edge deployment again and it should complete.
  3. Once the Edge(s) are deployed, you can delete the switches on the hosts.
  4. To do this, edit the settings of each Edge node VM and disconnect NIC 5.
  5. Proceed to delete the standard switches.