NSX Transport Nodes ( Edge/ESXi host) show "Down/Degraded" due to DNS issue
search cancel

NSX Transport Nodes ( Edge/ESXi host) show "Down/Degraded" due to DNS issue

book

Article ID: 407410

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • The status of the ESXi transport node appears as Degraded
  • The Node Status of the Edges is displayed as Down
  • You may observe the following alarm related to a failed reverse DNS lookup for the Manager node configuration
    Reverse DNS lookup failed for Manager node ######## with IP  address ###### and the publish_fqdns flag was set

  • You may also observe an alarm indicating that the transport node’s control plane connection to the Manager node is down
    The Transport node ###### control plane connection to Manager node ##### is down for atleast 3 minutes from the Transport node's point of view.

  • The get controllers command on ESXi or Edge transport nodes may show no output
    # get controllers
    Controller IP    Port     SSL         Status       Is Physical Master   Session State  Controller FQDN           Failure Reason
  • Or, the status appears as Disconnected with the failure reason listed as Maintenance Mode

    # get controllers
     Controller IP    Port     SSL         Status       Is Physical Master   Session State  Controller FQDN                     Failure Reason         
    x.x.x.x         1235   enabled      not used            false              null       xxxxxxxxxxxxxxxxxxxxxxxxxxxx        MAINTAINANCE_MODE       
    x.x.x.x         1235   enabled    disconnected           true              down       xxxxxxxxxxxxxxxxxxxxxxxxxxxx        MAINTAINANCE_MODE       
    x.x.x.x         1235   enabled      not used            false              null       xxxxxxxxxxxxxxxxxxxxxxxxxxxx        MAINTAINANCE_MODE

 

  • The logs below appear in /var/log/proton/nsx-api.log on the NSX Manager

    423349 2025-07-19T10:20:45.672Z ERROR workerTaskExecutor-1-45 ControllerUtils 5094 FABRIC [nsx@6876 comp="nsx-manager" errorCode="MP2119" level="ERROR" subcomp="manager"] Not sending ControllerInfoMsg for controller ClusterNodeConfigModel/######-####-####-####-####### as reverse DNS lookup for its IP <Edge's  IP> failed
    423350 2025-07-19T10:20:45.674Z  INFO workerTaskExecutor-1-45 DnsLookupProviderImpl 5094 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] No cached value for key: <Edge's IP> in fqdnToIpMap/ipToFqdnMap, will try to get data from IpAddressUtils
    423351 2025-07-19T10:20:45.674Z  INFO workerTaskExecutor-1-45 Utils 5094 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] getFqdnFromIp(): invoked with Ip Address <Edge's IP>
  • On some Edge devices, the controller info file (/etc/vmware/nsx/controller-info.xml) may be empty






Environment

  • NSX Data center
  • vSphere ESX

Cause

  • The issue occurs when DNS is either misconfigured or unavailable due to a lookup outage. Since this condition is not properly handled in NSX, it results in the loss of controller connectivity.
  • When DNS lookup fails, NSX continues to send controller messages; however, the message contents remain unpopulated (skipped) because of the failed DNS lookup.

Resolution

 

  • Verify and Fix DNS

    • Resolve any DNS issues in the environment

    • Ensure that both forward and reverse DNS lookups are properly configured on the DNS server

  • Post resolving DNS issue

    a. Recover ESXi Transport Nodes

    • Restart the management services on all affected ESXi transport nodes to bring them out of the degraded state:
      service.sh restart

    b. Recover Edge Nodes

    1. Restart the local controller service on the standby Edge node:
      restart local-controller

    2. Place the Edge node in maintenance mode for a few seconds.

    3. Exit maintenance mode.

    4. Wait until the Edge node status returns to Healthy.

    5. For an Edge node with an empty controller-info.xml file (/etc/vmware/nsx/controller-info.xml), copy the file from another healthy Edge node and place it in the same directory.

    6. Restart the NSX proxy service on the affected Edge node:
      /etc/init.d/nsx-proxy restart

    7. Wait until all nodes report a Healthy state.




Additional Information

If the above steps in this KB do not resolve the issue, raise a support ticket with Broadcom support selecting NSX as the product. 

Please refer to the below Kb with similar issue :
https://knowledge.broadcom.com/external/article/424895

Handling Log Bundles for offline review with Broadcom support.