Troubleshooting Edge Deletion Failures in VMware NSX
search cancel

Troubleshooting Edge Deletion Failures in VMware NSX

book

Article ID: 345813

calendar_today

Updated On:

Products

VMware NSX VMware NSX-T Data Center

Issue/Introduction

Addressing the issue of a stale edge stuck in the "Deletion in progress" state in the NSX UI.
  1. Failures when trying to delete powered-off, orphaned, or disconnected edge VMs, specifically those auto-deployed via NSX Manager.
  2. NSX edge node might get deleted from the NSX UI but its stale entry remains in the vCenter inventory.
  3. The vTEP assignments (IP addresses) associated with an edge are released from the NSX side once it's deleted. These IPs might be assigned to new edges, causing duplicate IPs and potential disruptions.
  4. An edge node is marked as orphaned.


Environment

  • VMware NSX
  • VMware NSX-T Data Center

Cause

  • Trying to delete Edge from NSX-T managers which are deleted from VC/ Powered off/ orphaned, or disconnected, specifically those auto-deployed via NSX Manager/ via OVF from VC.
  • Host/ datastore on which edge is deployed is now corrupt/crashed/ unresponsive.
  • This can be a result of incorrectly deleting the Edge TN(using corfu command, etc.)
  • Initiating a delete operation on the NSX Edge node from the NSX UI causes the NSX manager to contact the edge directly for deletion.
  • If the NSX manager fails to contact the Edge node, it tries to ask the vCenter for a delete operation (applies when the edge node was deployed via the NSX UI and not OVA).
  • If both steps fail, the EDGE gets stuck in the "Delete in progress" state.

Resolution

Make sure the target edge to be deleted is not consumed in any edge cluster. If the edge is in use, please remove it from the edge cluster, using the steps mentioned in the document. Below mentioned are the generic steps for deleting an edge VM/BM.

NSX-T versions 3.2.x - 4.2:

  1. Attempt Edge Deletion:
    1. UI:
      • You can go to System > Fabric > Nodes > Select the target edge > Perform the Delete action
    2. API:
      • Use the DELETE https://<manager-ip>/api/v1/transport-nodes/<tn-id> API. This can be run using curl or using tools like Postman please see NSX API guide.
  2. If you've performed deletion of Edge from NSX-T manager(API/UI) but the edge state in UI shows it is stuck for a long time(>15 mins) in Deletion in progress state, either the Edge VM is unreachable or deleted from VC. In order to confirm via API use GET api/v1/transport-nodes/<tn-id>/state. Refer "node_deployment_state" is set to "DELETE_IN_PROGRESS" and "state" is set to "orphaned"
  3. In such a case, please make sure the edge VM is deleted from the vCenter, and fire following API-
    1. POST https://<manager-ip>/api/v1/transport-nodes?action=clean_stale_entries This API will clean up all the stale edges from the NSX-T manager. This can be run using curl or using tools like Postman please see NSX API guide.

NSX-T versions > 4.2.0:

  1. Attempt Edge Deletion
    1. UI:
      • You can go to System > Fabric > Nodes > Select the target edge > Perform the Delete action
    2. API:
      • Use the DELETE https://<manager-ip>/api/v1/transport-nodes/<tn-id> API. This can be run using curl or using tools like Postman please see NSX API guide.
  2. If you've performed deletion of Edge from NSX-T manager(API/UI) but the edge state in UI shows it is stuck for a long time(>15 mins) in Delete Failed state, either the Edge VM/BM is unreachable or deleted from VC.
  3. Make sure you've removed the edge VM from vCenter or used "del nsx" nsxcli command for BME.
    • Using UI
      • Upon clicking Delete Failed state on UI, you'll see the below pop-up
      •  
      • Then Go ahead and remove edge by clicking Done, Remove from NSX.
    • Using API:
    • In order to cleanup a specific stale edge VM use the below API:
      • POST https://<manager-ip>/api/v1/transport-nodes/<tn-id>/action/clean-stale-entries

NSX-T versions >= 9.0:

In cases where, you've already attempted edge deletion from NSX-T manager but the deletion is stuck for a long time, the edge VM is deleted from vCenter or the edge is unreachable you can proceed with calling the below API, to cleanup a specific stale edge VM:

  • DELETE https://<manager-ip>/api/v1/polilcy/api/v1/infra/sites/<site-id>/enforcement-points/<enforcementpoint-id>/edge-transport-nodes/<edge-transport-node-id>?force=true

Steps mentioned below give a generic way to follow and troubleshoot the edge deletion:

  1. Attempt Edge Deletion
    1. UI:
      • You can go to System > Fabric > Nodes(or Edges as called in newer versions) > Perform the Delete action
    2. API:
      • Use the DELETE https://<manager-ip>/api/v1/polilcy/api/v1/infra/sites/<site-id>/enforcement-points/<enforcementpoint-id>/edge-transport-nodes/<edge-transport-node-id> API
      • This can be run using curl or using tools like Postman please see NSX API guide.
  2. If you've performed deletion of Edge from NSX-T manager(API/UI) but the edge state in UI shows it is stuck for a long time(>15 mins) in Delete Failed state, either the Edge VM/BM is unreachable or deleted from VC. For API users, 
    1. API usage:
    2. https://<manager-ip>/api/v1/policy/api/v1/infra/sites/<site-id>/enforcement-points/<enforcementpoint-id>/edge-transport-nodes/<edge-transport-node-id> /state API and check for the "state" is set to "orphaned"
  3. Make sure you've removed the edge VM from vCenter or used "del nsx" nsxcli command for BME.
    • Using UI
      • Upon clicking Delete Failed state on UI, you'll see the below pop-up
      • Then Go ahead and remove edge by clicking Done, Remove from NSX.
    • Using API:
    • In order to cleanup a specific stale edge VM use the below API:
      • DELETE https://<manager-ip>/api/v1/polilcy/api/v1/infra/sites/<site-id>/enforcement-points/<enforcementpoint-id>/edge-transport-nodes/<edge-transport-node-id>?force=true

Notes:

  1. Wait up to 5 minutes for stale entities to be wiped out, Even after waiting for some time the cleanup API doesn’t remove stale entries, retry the API
  2. A known issue and is fixed in version 4.1.1- Sometimes, in case of BMEs you can run into this error when you call the cleanup API-

{ "httpStatus": "BAD_REQUEST", "error_code": 16077, "module_name": "FABRIC", "error_message": "[Fabric] Refresh <#Edge_UUID#> placement references failed." } 



If the Edge VM is still existing in NSX UI inventory after following the workaround, restart nsx-proxy service on the host which was hosting the Edge VM
/etc/init.d/nsx-proxy status | restart

Additional Information

Enhancements in Versions 3.2.1.1 and 4.0.0.1

This KB article addresses challenges related to deleting powered-off, orphaned, or disconnected edge VMs, specifically those auto-deployed through NSX Manager.
Previously, as described in the "Symptoms" section, the process did not delete the edge if it was unreachable, due to concerns over potential duplicate VTEP issues. With the latest enhancements, the updated behavior is as follows:

1. Standard Deletion Workflow: If the edge is reachable and the host switch config is cleared from the edge, it's safe to delete the edge VM from NSX and VC.

2. However, if issues arise during the first step, primarily caused by connectivity problems between the edge and the manager, we examine the NSX inventory for the edge VM's presence:

      a. If the Edge VM is in the NSX Inventory:      

      i. If identified within the vCenter (VC), the next step is to power it off, delete the VM, and release associated VTEP resources.
      
      ii. If the VM isn't found in VC (possibly due to a changed MORef ID unrecognized by NSX), an alarm is raised. In such cases, if the edge VM exists on VC with a different MORef ID, users should apply the provided workaround. Note that changes in MORef ID can happen if the edge VM gets restored post-backup, is removed and re-added to VC inventory, or undergoes a vMotion operation.
NSX Manager can't find Edge VMs deployed on an ESXi that was not an NSX transport node before NSX-T 3.2, even if NSX-T is upgraded to NSX-T 3.2 or later after deployment.
NSX Manager can find such Edge VMs if deployed in NSX-T 3.2, and deletion is invoked in NSX-T 3.2.4 or later.
     
      iii. If the VM doesn't exist in VC because of manual user deletion, the workaround should still be employed for edge cleanup. The challenge lies in distinguishing between a deleted VM and one in VC with a changed MORef ID.


   b. If the Edge VM isn't in the NSX Inventory: This might be due to NSX inventory discrepancies or the VM's deletion from VC. Users should refer to the workaround to delete the edge from NSX.

Important Note

Should a network disconnection occur between the manager and edges, and if the host switch configuration along with the edge gets deleted from the manager, VTEP resources will be freed. The subsequently released IP might be allocated to a new edge from the IP pool. Such actions can produce duplicate edge IP addresses, creating serious datapath disruptions.

To avoid such scenarios, NSX Manager attempts to establish connectivity with the edge/VC prior to the edge VM's deletion. It's imperative to understand that if the manager can't access the Edge or VC, it can't deduce that the edge has been deleted, warranting user intervention.

Further, with these improvements, even if the edge turns unreachable for NSX Manager, it remains trackable through VC and NSX inventory due to its auto-deployment on VC. This facilitates edge identification and cleanup from VC when needed.


Impact/Risks:
Leaving behind stale EDGE VM entries in the VC Inventory can disrupt the datapath.