Installing NSX on Host fails with "already exists" error or upgrade fails with "Failed to get Host status"

Products

VMware NSX

Issue/Introduction

Install Failure/ Validation error: While preparing an ESXi host as a Transport Node, the following errors are seen:

Node <node-name> with same ip <###.###.###.###> already exists

Or

Error: General error has occurred. Discovered node with id:<Discovered Node ID:host-###> is already prepared having fabric node id:<Transport Node UUID>.

Or

Node <node_name> with same ip <ip_address> already exists. Check the state of existing transport node. If it is a failure in deletion then use force option to delete previous transport node and retry the operation.

Note: node_name is the last part of the resource path as found when you click the hamburger menu on the host in System, Fabric, Host and select Copy path to Clipboard. For example here it is #######-########-9c7e-436b-8660-############host-## from this path is /infra/sites/default/enforcement-points/default/host-transport-nodes/########-########-9c7e-436b-8660-############host-##.

Note: Transport Node UUID is the value retrieved when you click the hamburger menu on the host in System, Fabric, Host and select Copy ID to Clipboard. For above, it would be ########-9c7e-436b-8660-############

Upgrade Failure: during host upgrade the following error is seen:

Failed to get Host status for upgrade unit <Upgrade Unit ID> due to error Transport node <Transport Node UUID> not found

An ESXi host was removed from vCenter without first removing it from NSX.
The Host is not visible under the Host and Cluster section of the UI: System, Fabric, Hosts, Clusters or Standalone.
Reinstalling the ESXi OS, does not resolve the issue, as the IP or name already exists in NSX.
An upgrade pre-check may fail and can cause the upgrade process to pause.
Stale entries in the NSX can cause SDDC Manager pre-checks to fail with validation errors.
Running either of the below Policy or Manager GET API's, using the <Transport Node UUID> or <node_name> from above errors, reveal the following results:
- '/api/v1/transport-nodes/<Transport Node UUID>/state'
- '/policy/api/v1/infra/sites/default/enforcement-points/default/host-transport-nodes/<node_name>/state'
- Results:

  "node_deployment_state" : {     "state" : "failed",
    "details" : [ {
      "sub_system_id" : "########-######-######-######-############",
      "state" : "failed",
      "failure_message" : "Failed to uninstall the software on host. Host OS version not found.\n",
      "failure_code" : 26020
    } ]
  },
  "deployment_progress_state" : {
    "progress" : 40,
    "current_step_title" : "Removing NSX bits"

Note: A transport node is a host prepared with NSX VIB's.

Environment

VMware NSX 9.0.0

VMware NSX 4.x

VMware NSX-T Data Center 3.x

Cause

When an ESXi host is removed directly from vCenter without first removing NSX, it can result in entries for that host to remain in the NSX database.

The correct procedure to remove NSX from a single host:

Uninstall NSX Data Center from a Managed Host in a vSphere Cluster

Note: This is also referred to as a "Stale Host" and will prevent adding back a refreshed or new host using the same name.

Resolution

This issue has been resolved in NSX 4.2.3 and 9.0.1, available at Broadcom downloads.

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Workarounds:

Environments may have reached this state by following different steps, therefore, there are a number of possible workaround options that need to be tried.

Install Failure Prerequisite

Before proceeding with the following options, move the host that has failed to install/upgrade out of the vSphere cluster and make it a standalone host in vSphere.

Then proceed to go through the below Options in order from 1 to 4.

Upgrade Failure Prerequisite

No action required, proceed through the below Options in order 1 to 4.

To move the host out of the vSphere cluster:

Make sure the host is in maintenance mode and select the host in vCenter:
And move it to the datacenter level:
In the NSX UI, the host will now show up under System, Fabric, Hosts, Other Nodes:

Option 1:

Clicking on the Uninstall failed will open a new popup frame.
Click on View errors to see the details in the next popup frame.
Error message will vary depending on the actual root cause of the case:
Click Cancel on the popup message and select the host by clicking the check box beside it and then click REMOVE NSX:
And check the Force Delete check box on the window which pops up and then click DELETE:
Wait for the Transport Node deletion to start:
If the deletion is successful, the host will show as Not Configured:
Once the force delete is complete, the ESXi host can now be re-added to vSphere cluster and if it has a TNP applied, it will get prepared for NSX again.

Note:
If the Host is not present in the NSX UI, under System, Fabric, Hosts, in either Clusters/Other Nodes/Standalone, please proceed to Option 2. If you have already completed Option 2, and the issue is still present after retrying Option 1 after reindexing, please proceed to Option 3.

Option 2:

In some cases, the host may not appear in the NSX UI due to search indexing failures.

On all three NSX manager nodes, log in as the admin user and run the following three commands. Ensure the resync/re-indexing is completed before running the next command.
Please refer KB article NSX Manager UI displays notification "Failed to fetch System details. Please contact the administrator. Error: null (Error code: 513002)" for more information.
start search resync policy
start search resync manager start search resync telemetry
After you run the above commands, please allow some time for the reindexing to complete, this depends on the size of the environment, please allow at least 10 minutes.
Note: During the period of reindexing, you may notice the NSX UI will display an error in relation to the indexing and indicates to try again later; this is expected, due to the indexing occurring.
If the host is still listed, please refresh indexing using the following commands:
start search resync all
Please allow some time for re-indexing to complete. Depending on the size of the environment, this may take some time. During the period of reindexing, NSX UI will show indexing notifications.
Once the reindexing is complete, after at least 10 minutes, go back to Option 1 and follow the steps there again.

Option 3 (using Manager API call):

If you have completed Option 1 and Option 2, and the host still does not appear on the NSX UI, to allow removal, the following API steps can be used to remove the transport node.

Run the following API call:
GET https://<NSX Mgr IP>/api/v1/transport-nodes/<Transport Node UUID>/state.
Note: Replace <Transport Node UUID> with the Transport Node UUID, as reported in the error message (see Issue/Introduction section).
If the state value in the API response is not "Object Not found" then proceed to Option 3 Step 3.
Note: The state value should be "Object not found" when the host is successfully removed.
For NSX-T 3.2.x and 4.x, run the following API call:

DELETE https://<NSX Mgr IP>/api/v1/transport-nodes/<Transport Node UUID>?force=true&unprepare_host=false
Note: Replace <Transport Node UUID> with the Transport Node UUID, as reported in the error message (see Issue/Introduction section).

Wait five minutes, then run the GET transport node state command again, as seen in Option 3 Step 1 periodically until "Object Not found" is returned.
Once GET API returns "Object Not found", move the host back into the original cluster to prepare it for NSX. If a Transport Node Profile is applied, host preparation should start automatically. Otherwise, proceed and prepare the transport node as before.

Option 4 (using Policy API call):

If completing Option 3 did not clear the transport node, you can use the following:

Run the following API call:

GET https://<NSX Mgr IP>/policy/api/v1/infra/sites/default/enforcement-points/default/host-transport-nodes/<node_name>?force=true&unprepare_host=false

Note: Replace <node_name> with the node ID you received in the error (see Issue/Introduction section).
If the state value in the API response is not "Object Not found" then proceed to Option 4 Step 3.
Note: The state value should be object not found when the host is successfully removed.
For NSX-T 3.2.x and 4.x, run the following API call:

DELETE https://<NSX Mgr IP>/policy/api/v1/infra/sites/default/enforcement-points/default/host-transport-nodes/<node_name>?force=true&unprepare_host=false
Note: Replace <node_name> with the node ID you received in the error (see Issue/Introduction section).
Wait five minutes, then run the GET transport node state command again, as seen in Option 4 Step 1 periodically until "Object Not found" is returned.
Once GET API returns "Object Not found", move the host back into the original cluster to prepare it for NSX. If a Transport Node Profile is applied, host preparation should start automatically. Otherwise, proceed and prepare the transport node as before.

Security Only Clusters

If the cluster is Security Only, prepared using vLCM or has Service Insertion installed, it is not possible to detach the transport node profile.

In such circumstances, to uninstall NSX, either remove NSX from the whole cluster or move the single host out of the NSX prepared cluster to the datacenter level in vCenter.

Once the host is no longer part of an NSX prepared cluster NSX can be removed using the NSX GUI.

Note: If vLCM cluster is utilized, and VIBs remain on host after moving host out of cluster, you may need to manually remove VIBs from CLI and reboot host with command "nsxcli -c del nsx"

Note: If using curl to run the API, if the API has '?' in the URL, then enclose the full URL in quotes.

Note: In all above API calls, replace <NSX Mgr IP> with the IP address or FQDN of an NSX manager node.

Option 5 (scripted database cleanup):

If the issue persists after you have followed the steps in Option 1-4 above, or if the UI/API based approach is not suitable for your environment, you can use the scripts attached to this KB article to do the cleanup:

Note the name of the Host Transport Node in the problematic state.
Download all scripts attached to this KB article.
Use WinSCP or similar tool to copy the scripts to one NSX Manager (any one node in the cluster). Copy the files to Manager's /image directory.
SSH to the node where the scripts were copied in step (3.) as root.
Change the current working directory to /image:
cd /image
Run the script:
python cleanup_stale_tn_wrapper.py --tn-paths <Comma separated list of full policy path of TN> --password <NSX MP password>e.g.
/infra/sites/default/enforcement-points/default/host-transport-nodes/lvn-abc-00-abc12345-####-####-####-d4efea5e6123host-01 --password "example_password_01"
Once run, the script will delete the old record. If a host with the same IP address is now added in the cluster with Transport Node Profile, the script will also trigger transport node creation.

If none of the options have resolved the issue, please collect the information outlined in the Additional Information section below and open a technical support case with Broadcom Support for further investigation and refer to this KB article.

For more information, see Creating and managing Broadcom support cases.

Additional Information

If you are contacting Broadcom support about this issue, in order to aid a timely response and resolution, please provide the following:

NSX version.
Was the issue encountered during an upgrade or install?
Were all workaround options completed and if not, which options were not completed and reason why they were not completed or what issue prevented completion of them?
NSX Manager log bundles.
ESXi host log bundles for hosts that are failing to configure as transport nodes.
Text of any error messages seen in the NSX GUI or command lines pertinent to the investigation and screenshot.

Handling Log Bundles for offline review with Broadcom support

Attachments

prepare_host_transport_nodes_delete_data.py get_app

cleanup_stale_tn_wrapper.py get_app

cleanup_host_transport_nodes.py get_app