Shrink EdgeCluster RETRY task fails in SDDC Manager after recovering from NSX-T MP outage.

Products

VMware Cloud Foundation

Issue/Introduction

Symptoms:

Edge Cluster Shrink workflow is failed while deleting the selected edge node from edge cluster. The task panel shows the error as shown below for workflow's sub task 'Delete NSX-T Data Center Edge Node VM':

Message: Invalid parameter: {0}
Remediation Message:
Reference Token: 1PBV5U
Cause:
Type: com.vmware.vapi.std.errors.NotFound
Message: NotFound (com.vmware.vapi.std.errors.not_found) => { messages = [], data = struct => {error_message=The requested object : TransportNode/########-####-####-####-########33f3 could not be found. Object identifiers are case sensitive., httpStatus=NOT_FOUND, error_code=600, module_name=common-services}, errorType = NOT_FOUND }

The domain manager log file (/var/log/vmware/vcf/domainmanager/domainmanager.log) on SDDC Manager Virtual Machine contains the error information as shown below for failed task:
2021-11-16T09:10:29.590+0000 INFO [vcf_dm,1bba40f51be1edaf,4708] [c.v.v.c.f.p.n.a.DeleteNsxtEdgeNodeVmAction,dm-exec-14] Found Edge node (ID: ########-####-####-####-########33f3).
2021-11-16T09:10:29.724+0000 DEBUG [vcf_dm,1bba40f51be1edaf,4708] [c.v.v.c.n.s.c.c.ApiConnection,dm-exec-14] Closed ApiClient connection.
2021-11-16T09:10:29.732+0000 ERROR [vcf_dm,1bba40f51be1edaf,4708] [c.v.e.s.o.model.error.ErrorFactory,dm-exec-14] [1PBV5U] VCF_ERRORS_GENERIC_INPUT_PARAM_ERROR Invalid parameter: {0}
com.vmware.evo.sddc.orchestrator.exceptions.OrchTaskException: Invalid parameter: {0}
        at com.vmware.vcf.common.fsm.plugins.nsxt.action.DeleteNsxtEdgeNodeVmAction.preValidate(DeleteNsxtEdgeNodeVmAction.java:95)
        at com.vmware.vcf.common.fsm.plugins.nsxt.action.DeleteNsxtEdgeNodeVmAction.preValidate(DeleteNsxtEdgeNodeVmAction.java:25)
        at com.vmware.evo.sddc.orchestrator.platform.action.FsmActionState.lambda$static$0(FsmActionState.java:18)
        at com.vmware.evo.sddc.orchestrator.platform.action.FsmActionState.invoke(FsmActionState.java:62)

Caused by: com.vmware.vapi.std.errors.NotFound: NotFound (com.vmware.vapi.std.errors.not_found) => {
    messages = [],
    data = struct => {error_message=The requested object : TransportNode/########-####-####-####-########33f3 could not be found. Object identifiers are case sensitive., httpStatus=NOT_FOUND, error_code=600, module_name=common-services},
    errorType = NOT_FOUND
}
        at com.vmware.vapi.std.errors.NotFound._newInstance2(NotFound.java:182)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)

2021-11-16T09:10:29.743+0000 DEBUG [vcf_dm,1bba40f51be1edaf,4708] [c.v.e.s.o.c.ProcessingTaskSubscriber,dm-exec-14] Collected the following errors for task with name DeleteNsxtEdgeNodeVmAction and ID ########-####-####-####-########0380: [ExecutionError [errorCode=null, errorResponse=LocalizableErrorResponse(messageBundle=com.vmware.vcf.common.fsm.plugins.nsxt.messages)], ExecutionError [errorCode=null, errorResponse=LocalizableErrorResponse(messageBundle=com.vmware.vcf.common.fsm.plugins.nsxt.messages)], ExecutionError [errorCode=null, errorResponse=LocalizableErrorResponse(messageBundle=com.vmware.evo.sddc.common.core.error.messages)]]
2021-11-16T09:10:29.755+0000 INFO [vcf_dm,1bba40f51be1edaf,fd0a] [c.v.e.s.o.c.ProcessingOrchestratorImpl,dm-exec-14] Prevalidation comp

Environment

VMware Cloud Foundation 4.4.x
VMware Cloud Foundation 4.3.x

Cause

During the execution of Edge Cluster Shrinkage workflow, VCF managed Virtual Machines (vCenter and NSX-T) get rebooted. The outage/reboot happened during the deletion of Edge Node Virtual Machine as part of Edge Cluster shrinkage operation. The outage/reboot of vCenter and NSX Virtual Machines left the selected Edge Node Virtual Machine in an inconsistent state in NSX. Due to the inconsistent state of Edge Node Virtual Machine, Edge Cluster Shrinkage workflow is not able to perform any action (I.e. delete) for the selected Edge Node Virtual Machine and reported the error for the same.

Once VCF managed Virtual Machines(vCenter and NSX) are back online, the customer tries to restart the failed Edge Cluster Workflow. The Edge Cluster Shrinkage workflow starts from the previous failure task 'Delete NSX-T Data Center Edge Node VM'. The task 'Delete NSX-T Data Center Edge Node VM' is not able to perform the deletion operation because the selected Edge Virtual Machine is in in-consistent state.

Note: Log-in to NSX Web Console and navigate System → Fabric → Nodes → Edge Transport Nodes. It will show the selected Edge Virtual Machine with the state as DELETION FAILD/Configuration Error.

Resolution

NSX and vCenter Virtual Machines outage left the system in in-consistent state (Partial deleted edge node). In that case, retry edge cluster shrinkage automation is not supported. Perform manual actions for the deletion of selected Edge Node Virtual Machine from NSX and restart the Edge Cluster Shrinkage workflow from SDDC Manager to complete the shrinkage of Edge Cluster.

Workaround:
To workaround the issue, please follow the steps below:

Log-in to NSX Web Console and navigate System → Fabric → Nodes → Edge Transport Nodes. It will show the selected Edge Virtual Machine with the state as DELETION FAILD/Configuration Error.
Select the failed Edge Node and click on the DELETE to delete the node forcibly from NSX.
Log-in to SDDC Manager Web Console and restart the failed Edge Cluster Shrinkage workflow/task.