Troubleshooting Edge Deletion Failures in VMware NSX

Products

VMware NSX VMware NSX-T Data Center

Issue/Introduction

Addressing the issue of a stale edge stuck in the "Deletion in progress" state in the NSX UI.

Failures when trying to delete powered-off, orphaned, or disconnected edge VMs, specifically those auto-deployed via NSX Manager.
NSX edge node might get deleted from the NSX UI but its stale entry remains in the vCenter inventory.
The vTEP assignments (IP addresses) associated with an edge are released from the NSX side once it's deleted. These IPs might be assigned to new edges, causing duplicate IPs and potential disruptions.
An edge node is marked as orphaned.

Environment

VMware NSX
VMware NSX-T Data Center

Cause

Initiating a delete operation on the NSX Edge node from the NSX UI causes the NSX manager to contact the edge directly for deletion.
If the NSX manager fails to contact the Edge node, it tries to ask the vCenter for a delete operation (applies when the edge node was deployed via the NSX UI and not OVA).
If both steps fail, the EDGE gets stuck in the "Delete in progress" state.

Resolution

This issue is resolved in VMware NSX 4.1.1 and 3.2.4 available at Broadcom Downloads.
If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Attempt Edge Deletion:
- Use the DELETE https://<manager-ip>/api/v1/transport-nodes/<tn-id> API.
  - This can be ran using curl or using tools like Postman please see NSX API guide.
- For NSX-T version 3.2.1 and later, if the transport node is orphaned, the stale API clears table entries.
- Note: This API won't work for versions prior to 3.2.1.
Monitor Deletion Progress:
- The background retries (with exponential backoff) try to complete the deletion on the edge.
- Users might observe changes in the /state API output.
- The “state” field in /state API will transit from “pending” -> “in_progress” -> “failed” -> “orphaned”.
- Check the status of deletion using GET https://<manager-ip>/api/v1/transport-nodes/<tn-id>/state API.
Identify Stuck Deletion:
- If the deletion is stuck for more than 30 minutes, consider the scenario as stuck
Sample Output Indicating Stuck Deletion:

{
   "details": [
      {
         "failure_code": 8804,
         "failure_message": " " Host configuration: Failed to send the HostConfig message. [TN=TransportNode/<#Edge_UUID#>]. Reason: Failed to send HostConfig RPC to MPA node:<#Edge_UUID#>. Error: Unable to reach client <#Edge_UUID#>, application SwitchingVertical.",
         "state": "orphaned",
         "sub_system_id": "",
         "sub_system_type": "Host"
      }
   ],
     {
    "failure_code": 8804,
    "failure_message": "Host configuration failed. Number of retries: 1298. Next retry attempt will be between [DATE-TIME] and [DATE-TIME] (UTC).",
    "maintenance_mode_state": "DISABLED",
    "node_deployment_state": {
        "state": "DELETE_IN_PROGRESS"
     },
    "state": "orphaned",
    "transport_node_id": "<#Edge_UUID#>"
 }

5. If the deletion process is stuck due to network disruptions between the NSX Manager and the edge VM, manual intervention is needed and follow these steps:

Clean up the edge VM from VCenter.
For Bare metal edges and edge VMs deployed using OVA, run the "del nsx" command on edge CLI.
Use the API POST https://<manager-ip>/api/v1/transport-nodes?action=clean_stale_entries to clean stale edge VMs on NSX Manager.
Wait up to 5 minutes for stale entities to be wiped out. Sometimes, in case of BMEs you can run into this error when you call the cleanup API, it's a known issue and is fixed in version 4.1.1

{ "httpStatus": "BAD_REQUEST", "error_code": 16077, "module_name": "FABRIC", "error_message": "[Fabric] Refresh <#Edge_UUID#> placement references failed." }

If the cleanup (clean_stale_entries) API doesn’t remove all stale entries, retry steps 4 & 5.

Note: This workaround is suitable for NSX-T releases version 3.2.1 and later.

If the Edge VM is still existing in NSX UI inventory after following the workaround, restart nsx-proxy service on the host which was hosting the Edge VM
/etc/init.d/nsx-proxy status | restart

Additional Information

Enhancements in Versions 3.2.1.1 and 4.0.0.1

This KB article addresses challenges related to deleting powered-off, orphaned, or disconnected edge VMs, specifically those auto-deployed through NSX Manager.
Previously, as described in the "Symptoms" section, the process did not delete the edge if it was unreachable, due to concerns over potential duplicate VTEP issues. With the latest enhancements, the updated behavior is as follows:

1. Standard Deletion Workflow: If the edge is reachable and the host switch config is cleared from the edge, it's safe to delete the edge VM from NSX and VC.

2. However, if issues arise during the first step, primarily caused by connectivity problems between the edge and the manager, we examine the NSX inventory for the edge VM's presence:

a. If the Edge VM is in the NSX Inventory:

i. If identified within the vCenter (VC), the next step is to power it off, delete the VM, and release associated VTEP resources.

ii. If the VM isn't found in VC (possibly due to a changed MORef ID unrecognized by NSX), an alarm is raised. In such cases, if the edge VM exists on VC with a different MORef ID, users should apply the provided workaround. Note that changes in MORef ID can happen if the edge VM gets restored post-backup, is removed and re-added to VC inventory, or undergoes a vMotion operation.

NSX Manager can't find Edge VMs deployed on an ESXi that was not an NSX transport node before NSX-T 3.2, even if NSX-T is upgraded to NSX-T 3.2 or later after deployment.
NSX Manager can find such Edge VMs if deployed in NSX-T 3.2, and deletion is invoked in NSX-T 3.2.4 or later.

iii. If the VM doesn't exist in VC because of manual user deletion, the workaround should still be employed for edge cleanup. The challenge lies in distinguishing between a deleted VM and one in VC with a changed MORef ID.

b. If the Edge VM isn't in the NSX Inventory: This might be due to NSX inventory discrepancies or the VM's deletion from VC. Users should refer to the workaround to delete the edge from NSX.

Important Note

Should a network disconnection occur between the manager and edges, and if the host switch configuration along with the edge gets deleted from the manager, VTEP resources will be freed. The subsequently released IP might be allocated to a new edge from the IP pool. Such actions can produce duplicate edge IP addresses, creating serious datapath disruptions.

To avoid such scenarios, NSX Manager attempts to establish connectivity with the edge/VC prior to the edge VM's deletion. It's imperative to understand that if the manager can't access the Edge or VC, it can't deduce that the edge has been deleted, warranting user intervention.

Further, with these improvements, even if the edge turns unreachable for NSX Manager, it remains trackable through VC and NSX inventory due to its auto-deployment on VC. This facilitates edge identification and cleanup from VC when needed.

Impact/Risks:
Leaving behind stale EDGE VM entries in the VC Inventory can disrupt the datapath.