NSX Host or Edge Transport Nodes have a status of "Unknown" and Tunnels are "Not Available" or "Validation Errors"

Products

VMware NSX

Issue/Introduction

The status of Transport Nodes, either Edge or Host are in Unknown state OR NSX configuration state shows Validation Errors on the UI and when queried via the API:

To get a list of all Transport Nodes and their IDs
GET https://<NSX_MANAGER>/api/v1/transport-nodes/

To get the status of a specific Transport Node
GET https://<NSX_MANAGER>/api/v1/transport-nodes/<tn-id>/status

The status of Tunnels may show Not Available in the UI.
There is no reported data plane impact.
The NSX manager log /var/log/proton/nsxapi.log has a log entry similar to this example:

"The requested object : TransportZoneProfile/<TransportZoneProfile-ID> could not be found.

Querying the TZPs UUID printed in the log above won't be available.
GET https://<NSX_MANAGER>/api/v1/transportzone-profiles?include_system_owned=true

Status of TN in Cluster of NSX UI is Validation Errors
Clicking on error can read "600: The requested object: ####-##-## could not be found. Object identifiers are case sensitive."
UUID cannot be found in NSX or VC global search
The NSX manager log /var/log/search/elasticsearch_index_indexing_slowlog.log has a log similar to example:

[{"resource_type":"BfdHealthMonitoringProfile","profile_id":"UUID_in_Validation_Error"}]}],"vmk_install_migration":[],"pnics_uninstall_migration":[],"vmk_uninstall_migration":[],"not_ready":false}],"resource_type":"StandardHost]

Environment

VMware NSX-T Data Center

VMware NSX

Cause

The status of the Transport Nodes (TNs) are unknown because the Transport-Zone Profile (TZP) which is being referenced does not exist. The reasons why the referenced Transport-Zone Profile (TZP) is not present in the system can be due to multiple reasons:

In previous versions of NSX-T, it was possible in policy mode to update or delete a Transport-Zone Profile (TZP), even if it was in use by a Transport Nodes(TN) or Transport-Node Profile(TNP).
In some cases, its observed that vRNI tool creates Transport-Zone Profile(TZP) and when customer removes the vRNI tool, it ends up deleting Transport-Zone Profile(TZP) which results in Transport Node (TN) in unknown state. Validation has been added to avoid this issue.

Resolution

To resolve this issue for Edge nodes, please open a support case with Broadcom support.

To resolve this issue for ESXi hosts please follow these steps:

1. Take a new FTP based backup and ensure the backup passphrase is known before proceeding.

2. Copy the attached logical-migration.jar file to one of the Managers and place it in the directory /opt/vmware/upgrade-coordinator-tomcat/temp/.

3. Stop proton on all three Manager nodes from the root shell:
# service proton stop

or

# /etc/init.d/proton stop

4. On the NSX Manager where the jar file was copied, run the following command. This command is a single line command with no line breaks. Note you must populate the admin password of the NSX Manager below

^{# java -Dcorfu-property-file-path=/opt/vmware/upgrade-coordinator-tomcat/conf/ufo-factory.properties -Djava.io.tmpdir=/opt/vmware/upgrade-coordinator-tomcat/temp -DLog4jContextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector -Dlog4j.configurationFile=/opt/vmware/upgrade-coordinator-tomcat/conf/log4j2.xml -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.util.logging.config.file=/opt/vmware/upgrade-coordinator-tomcat/conf/logging.properties -Dnsx-service-type=nsx-manager -DTransportZoneProfileRectifierInTNAndTNP.userName=admin -DTransportZoneProfileRectifierInTNAndTNP.password='ENTER_ADMIN_PASSWORD_HERE' -DTransportZoneProfileRectifierInTNAndTNP.updateTn=true -DTransportZoneProfileRectifierInTNAndTNP.updateTzp=true -cp /opt/vmware/upgrade-coordinator-tomcat/temp/logical-migration.jar com.vmware.nsx.management.migration.impl.TransportZoneProfileRectifierInTNAndTNP}5. Set file ownership
# chown uuc:uuc /var/log/upgrade-coordinator/upgrade-coordinator*log*

6. The procedure is complete once the following text is printed in the upgrade-coordinator log file:
# grep "Migration task finished" /var/log/upgrade-coordinator/upgrade-coordinator.log

7. Start proton on all three Manager nodes:
# service proton start

or

# /etc/init.d/proton start

8. Execute the following from NSXCLI on all the NSX manager nodes so that corfudb and Search indexes are in sync:
> start search resync policy
> start search resync manager

9. Login to the NSX UI and validate that the host status is resolved.
In some cases it maybe necessary to detach and reattach the TNP on impacted cluster to fully resolve the issue.

In cases where Service VMs are deployed in the cluster the affected host transport nodes are a part of, detaching TNP gives the error - "Error: Cluster ########-####-####-####-########bed1:domain-c10 has NSX managed service VM deployed or deployment is in progress. Delete these deployment, before deleting TN. (Error code: 26173)". In such a scenario, the alternative to detaching / re-attaching TNP would be to follow the below steps:

Place the host into maintenance mode.
Run the API "GET https://{{nsx-ip}}/api/v1/transport-nodes/<tn-id>" to get the payload.
Using the same payload, run the API "PUT https://{{nsx-ip}}/api/v1/transport-nodes/<tn-id>".
Wait for a couple of mins and check the configuration state of the host.
Check the status of the tunnels and in general the status of the host with respect to the manager.
Repeat the steps for all the affected hosts.

Attachments

logical-migration.jar get_app