Installing or Upgrading NSX on an ESXi host fails reporting the node already exists

Products

VMware NSX

Issue/Introduction

Install Failure/ Validation error: while preparing an ESXi host as a Transport Node, either of the following two error's are seen

Node<node-name> with same ip <###.###.###.###> already exists

or

Error: General error has occurred. Discovered node with id:<Discovered Node ID:host-###> is already prepared having fabric node id:<Transport Node UUID>.

or

under fabric/host we will see a similar warning

Upgrade Failure: during host upgrade the following error is seen:

Failed to get Host status for upgrade unit <Upgrade Unit ID> due to error Transport node <Transport Node UUID> not found

An ESXi host was removed from vCenter without first removing it from NSX.
The Host is not visible under the Host and Cluster section of the UI: System > Fabric > Hosts > Clusters or Standalone.
Reinstalling the ESXi OS, does not resolve the issue, as the IP or name already exists in NSX.
An upgrade pre-check may fail and can cause the upgrade process to pause.
Running the GET API '/api/v1/transport-nodes/########-25d7-4ff4-ba26-############/state' reveals the following results:
  "node_deployment_state" : {     "state" : "failed",
    "details" : [ {
      "sub_system_id" : "########-25d7-4ff4-ba26-############",
      "state" : "failed",
      "failure_message" : "Failed to uninstall the software on host. Host OS version not found.\n",
      "failure_code" : 26020
    } ]
  },
  "deployment_progress_state" : {
    "progress" : 40,
    "current_step_title" : "Removing NSX bits"

Note: A transport node is a host prepared with NSX VIB's.

Environment

VMware NSX 4.x
VMware NSX-T Data Center 3.x

Cause

When an ESXi host is removed directly from vCenter without first removing NSX, it can result in entries for that host remaining in the NSX database.

The correct procedure to remove NSX from a single host:

Uninstall NSX Data Center from a Managed Host in a vSphere Cluster

If the cluster is a Security Only, prepared using vLCM or has Service Insertion installed, it is not possible to detach the transport node profile.

In such circumstances, to uninstall NSX, either remove NSX from the whole cluster or move the single host out of the NSX prepared cluster to the datacenter level in vCenter.

Once the host is no longer part of an NSX prepared cluster NSX can be removed using the NSX GUI.

Resolution

This is a known issue impacting VMware NSX.

Workaround:

Environments may have reached this state by following different steps, therefore, there are a number of possible workaround options that need to be tried.

Install Failure Prerequisite

Before proceeding with the following options, move the host that has failed to install/upgrade out of the vSphere cluster and make it a standalone host in vSphere.

Then proceed to go through the below options in order from 1 to 3.

Upgrade Failure Prerequisite

No action required, proceed through the below options in order 1 to 3.

Option 1

In the NSX UI, check if you can find the impacted Host on the following pages:

System > Fabric > Hosts > Cluster

System > Fabric > Hosts > Other hosts

System > Fabric > Hosts > Standalone
If the ESXi host is present here, select it and click Delete NSX and select Force Delete.
Once the force delete is complete, the ESXi host can now be re-added to vCenter.
If the Host is not present, please proceed to Option 2. If you have already completed Option 2 and and retrying Option 1 after reindexing and the issue is still not resolved, please proceed to Option 3.

Option 2

In some cases, the host may not appear in the NSX UI due to search indexing failure's.

On all three NSX manager nodes, log in as the admin user and run the following two commands:

start search resync policy

start search resync manager
After you run the above commands, please allow some time for the reindexing to complete, this depends on the size of the environment, please allow at least 10 minutes.
Note: During the period of reindexing, you may notice the NSX UI will display an error in relation to the indexing and indicates to try again later; this is expected, due to the indexing occurring.
Once the reindexing is complete, after at least 10 minutes, go back to Option 1 and follow the steps there again.

Option 3

If you have completed Option 1 and 2 and the host still does not appear on the NSX UI, to allow removal the following API steps can be used to remove the transport node.

Run the following API call:"GET https://<NSX Mgr IP>/api/v1/transport-nodes/<UUID>/state" command.

Note: Replace <UUID> with the Transport Node UUID, as reported in the error message (see Issue/Introduction section).
Replace <NSX Mgr IP> with the IP address or FQDN of an NSX manager node.
If the state value in the API response is not Object Not found then proceed to step 3.
Note: The state value should be object not found when the host is successfully removed.
For NSX-T 3.2.x and 4.x, run the following API call:"DELETE https://<NSX Mgr IP>/api/v1/transport-nodes/<UUID>?force=true&unprepare_host=false".Note: Replace <UUID> with the Transport Node UUID, as reported in the error message (see Issue/Introduction section).
Replace <NSX Mgr IP> with the IP address or FQDN of an NSX manager node.
Wait five minutes, then run the GET transport node state command again, as seen in Step 1 periodically until "Object Not found" is returned.
Once GET API returns "Object Not found", move the host back into the original cluster to prepare it for NSX. If a Transport Node Profile is applied, host preparation should start automatically. Otherwise, proceed and prepare the transport node as before.

If none of the options have resolved the issue, please collect the information outlined in the Additional Information section below and open a technical support case with Broadcom Support for further investigation and refer to this KB article.

For more information, see Creating and managing Broadcom support cases.

Additional Information

If you are contacting Broadcom support about this issue, in order to aid a timely response and resolution, please provide the following:

NSX version.
Was the issue encountered during an upgrade or install.
Where all workaround options completed and if not, which options were not completed and reason why they were not completed or what issue prevented completion of them.
NSX Manager log bundles.
ESXi host log bundles for hosts that are failing to configure as transport nodes.
Text of any error messages seen in NSX GUI or command lines pertinent to the investigation and screenshot.

Handling Log Bundles for offline review with Broadcom support