NSX Host or Edge Transport Nodes have a status of "Unknown" and Tunnels are "Not Available" or "Validation Errors"
search cancel

NSX Host or Edge Transport Nodes have a status of "Unknown" and Tunnels are "Not Available" or "Validation Errors"

book

Article ID: 324194

calendar_today

Updated On:

Products

VMware NSX VMware NSX-T Data Center

Issue/Introduction

  • The status of Transport Nodes, either Edge or Host are in Unknown state OR NSX configuration state shows Validation Errors on the UI and when queried via the API:
To get a list of all Transport Nodes and their IDs
GET https://<NSX_MANAGER>/api/v1/transport-nodes/

To get the status of a specific Transport Node
GET https://<NSX_MANAGER>/api/v1/transport-nodes/<tn-id>/status
  • The status of Tunnels may show "Not Available" in the UI.
  • There is no reported data plane impact.
  • The NSX manager log /var/log/proton/nsxapi.log has a log entry similar to this example:
         "The requested object : TransportZoneProfile/<TransportZoneProfile-ID> could not be found."
  • Querying the TZPs UUID printed in the log above won't be available.
    GET https://<NSX_MANAGER>/api/v1/transportzone-profiles?include_system_owned=true
  • Status of TN in Cluster of NSX UI is Validation Errors
  • Clicking on error can read "600: The requested object: ####-##-## could not be found. Object identifiers are case sensitive."
  • UUID cannot be found in NSX or VC global search
  • The NSX manager log /var/log/search/elasticsearch_index_indexing_slowlog.log has a log similar to example:

    [{"resource_type":"BfdHealthMonitoringProfile","profile_id":"UUID_in_Validation_Error"}]}],"vmk_install_migration":[],"pnics_uninstall_migration":[],"vmk_uninstall_migration":[],"not_ready":false}],"resource_type":"StandardHost]

Environment

VMware NSX-T Data Center
VMware NSX

Cause

The status of the Transport Nodes (TNs) are unknown because the Transport-Zone Profile (TZP) which is being referenced does not exist. The reasons why the referenced Transport-Zone Profile (TZP) is not present in the system can be due to multiple reasons:

  • In previous versions of NSX, it was possible in policy mode to update or delete a Transport-Zone Profile (TZP), even if it was in use by a Transport Nodes(TN) or Transport-Node Profile(TNP).
  • In some cases, its observed that Aria Operations for Networks creates Transport-Zone Profile(TZP) and when Aria Operations for Networks is removed, it ends up deleting Transport-Zone Profile(TZP) which results in Transport Node (TN) in unknown state. Validation has been added in newer versions to avoid this issue.

Resolution

To resolve this issue for Edge nodes, open a support case with Broadcom support.

To resolve this issue for ESXi hosts, follow these steps:

NOTE: It has been found in one case that removing the Transport Node (Host) from the cluster, waiting for the NSX VIBs to uninstall, then readding the Transport Node to the cluster, resolved the unknown state.

  1. Take a new SFTP/FTP based backup and ensure the backup passphrase is known before proceeding.
  2. Copy the attached logical-migration.jar file to one of the NSX Manager nodes and place it in the directory /opt/vmware/upgrade-coordinator-tomcat/temp/.
  3. Stop the proton service on all three Manager nodes from the root shell:

    # service proton stop (do not use the "etc/init.d/proton stop" command,as proton service would start after a few seconds without user intervention)  
    Note: The service proton stop/start/restart commands has to be executed in the root mode 

  4. On the NSX Manager node where the jar file was copied, run the following command. This command is a single line command with no line breaks.
    Note: you must populate the admin password of the NSX Manager below

    # java -Dcorfu-property-file-path=/opt/vmware/upgrade-coordinator-tomcat/conf/ufo-factory.properties -Djava.io.tmpdir=/opt/vmware/upgrade-coordinator-tomcat/temp -DLog4jContextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector -Dlog4j.configurationFile=/opt/vmware/upgrade-coordinator-tomcat/conf/log4j2.xml -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.util.logging.config.file=/opt/vmware/upgrade-coordinator-tomcat/conf/logging.properties -Dnsx-service-type=nsx-manager -DTransportZoneProfileRectifierInTNAndTNP.userName=admin -DTransportZoneProfileRectifierInTNAndTNP.password='ENTER_ADMIN_PASSWORD_HERE' -DTransportZoneProfileRectifierInTNAndTNP.updateTn=true -DTransportZoneProfileRectifierInTNAndTNP.updateTzp=true -cp /opt/vmware/upgrade-coordinator-tomcat/temp/logical-migration.jar com.vmware.nsx.management.migration.impl.TransportZoneProfileRectifierInTNAndTNP

      Note: if you download the file multiple times make sure that the file that is provided is as logical-migration.jar, if this is as logical-migration(1).jar it will not run.                                                                       
  5. Set file ownership

    # chown uuc:uuc /var/log/upgrade-coordinator/upgrade-coordinator*log* 

  6. The procedure is complete once the following text is printed in the upgrade-coordinator log file:

    # grep "Migration task finished" /var/log/upgrade-coordinator/upgrade-coordinator.log

  7. Start proton on all three Manager nodes:

    # service proton start
    or 
    # /etc/init.d/proton start

  8. Execute the following from NSXCLI on all the NSX manager nodes so that corfudb and Search indexes are in sync:

     > start search resync policy
     > start search resync manager
  9. Login to the NSX UI and validate that the host status is resolved.

Note: In some cases it maybe necessary to detach and reattach the TNP on impacted cluster to fully resolve the issue.

In cases where Service VMs are deployed in the cluster the affected host transport nodes are a part of, detaching TNP gives the error - "Error: Cluster ########-####-####-####-########bed1:domain-c10 has NSX managed service VM deployed or deployment is in progress. Delete these deployment, before deleting TN. (Error code: 26173)". In such a scenario, the alternative to detaching / re-attaching TNP would be to follow the below steps:

  1. Place the host into maintenance mode.
  2. Run the API "GET https://{{nsx-ip}}/api/v1/transport-nodes/<tn-id>" to get the payload.
  3. Using the same payload, run the API "PUT https://{{nsx-ip}}/api/v1/transport-nodes/<tn-id>".
  4. Wait for a couple of minutes and check the configuration state of the host.
  5. Check the status of the tunnels and in general the status of the host with respect to the manager.
  6. Repeat the steps for all the affected hosts.

Attachments

logical-migration.jar get_app