NSX transport nodes show MPA DOWN after upgrade with corrupted FQDN entries in appliance-info.xml

Products

VMware NSX

Issue/Introduction

After upgrading NSX Manager, transport nodes (ESXi hosts and/or Edge nodes) display MPA connectivity status as DOWN with heartbeat failures. Host preparation fails at 48% with the error "Waiting for Connection to Managers" or "Time out waiting for host to join NSX Manager."

When running get managers from the NSX CLI on an affected ESXi host, the output shows a corrupted FQDN entry containing a DNS error message instead of the proper NSX Manager FQDN:

;; communications error to ##.##.##.##:53: timed out Unable to resolve fqdn

On Edge nodes, the syslog shows nsx-proxy attempting to connect to an invalid hostname containing the DNS error message:

StreamConnection Couldn't resolve 'ssl://;; communications error to ##.##.##.##:53: timed out:1234' (error: 1-Host not found)

The NSX Manager UI shows the transport nodes with connection status DOWN and alarms indicating MPA heartbeat failures.

Steps to validate

SSH to an affected ESXi host and run:

   nsxcli -c get managers

If the output shows a DNS error instead of the NSX Manager FQDN, the host is affected:

   ;; communications error to <ip>:53: timed out Unable to resolve fqdn

Check the appliance-info.xml file for corrupted FQDN entries:

   cat /etc/vmware/nsx/appliance-info.xml | grep fqdn

If the output shows entries like <fqdn>;; communications error to ##.##.##.##:53: timed out</fqdn> instead of valid FQDNs, the transport node is affected by this issue.

On Edge nodes, check the syslog for similar errors:

   grep 'communications error' /var/log/syslog | grep mpa-proxy-lib

Look for entries indicating corrupted FQDN values:

   ProcessConfig: fqdnv4 = ;; communications error to <ip>:53: timed out
   Couldn't resolve 'ssl://;; communications error to <ip>:53: timed out:1234'

On the NSX Manager, check nsxapi.log for related errors:

   grep -E "Invalid ip string|Time out waiting for host" /var/log/proton/nsxapi.log

Look for entries such as:

   Invalid ip string '<fqdn>'. Error parsing '<hostname>'
   Time out waiting for host to join NSX Manager

Environment

VMware NSX
VMware vSphere ESXi

Cause

This issue occurs when a stale or unreachable DNS server is configured on the NSX Manager at the time of transport node registration or configuration update.

During processing of a discovery request from a transport node, the NSX Manager's messaging component attempts to resolve FQDNs using the dnsLookupProvider.getFqdnFromIp() method. When the DNS lookup fails due to an unreachable or misconfigured DNS server, the method incorrectly returns the DNS error message (such as ";; communications error to ##.##.##.##:53: timed out") as the FQDN value instead of returning NULL.

This corrupted FQDN string is then sent to the transport nodes via the DiscoveryResponse and persisted in the /etc/vmware/nsx/appliance-info.xml file. When the transport node attempts to establish management plane connectivity, it tries to resolve this invalid hostname string, which fails and causes the MPA connection to remain DOWN.

Resolution

To resolve this issue, first verify and correct DNS name-server configuration on all affected components, then manually correct the corrupted FQDN entries in the appliance-info.xml file on each affected transport node.

Step 1: Verify and update DNS name-servers on NSX Manager

Before correcting transport nodes, ensure the NSX Manager has the correct DNS configuration. For detailed instructions, see Updating DNS server details for NSX-T Manager cluster.

SSH to the NSX Manager as admin.
Check the current name-server configuration:

   get name-servers

If any stale or incorrect DNS servers are listed, delete them:

   del name-server <old-dns-ip>

Add the correct DNS server(s):

   set name-servers <correct-dns-ip>

Verify the updated configuration:

   get name-servers

Repeat steps 1-5 on each NSX Manager node in the cluster.

Step 2: Verify and update DNS name-servers on ESXi hosts

SSH to the affected ESXi host.
Check the current DNS server configuration:

   esxcli network ip dns server list

If any stale or incorrect DNS servers are listed, remove them:

   esxcli network ip dns server remove --server=<old-dns-ip>

Add the correct DNS server(s)

   esxcli network ip dns server add --server=<correct-dns-ip>

Verify the updated configuration:

   esxcli network ip dns server list

Repeat steps 1-5 on each affected ESXi host.

Step 3: Verify and update DNS name-servers on Edge nodes

For detailed instructions on modifying Edge DNS configuration via CLI, see Modify NSX Edge DNS configuration information by command line.

SSH to the affected Edge node as admin.
Check the current name-server configuration:

   get name-servers

If any stale or incorrect DNS servers are listed, delete them:

   del name-server <old-dns-ip>

Add the correct DNS server(s):

   set name-servers <correct-dns-ip>

Verify the updated configuration:

   get name-servers

Repeat steps 1-5 on each affected Edge node.

Step 4: Correct appliance-info.xml on ESXi hosts

SSH to the affected ESXi host.
Backup the current appliance-info.xml file:

   cp /etc/vmware/nsx/appliance-info.xml /etc/vmware/nsx/appliance-info.xml.backup

Open the appliance-info.xml file in a text editor:

   vi /etc/vmware/nsx/appliance-info.xml

Locate the corrupted <fqdn> entries that contain the DNS error message. Replace each corrupted entry with the correct NSX Manager FQDN. Example of corrupted entry:

   <fqdn>;; communications error to ##.##.##.##:53: timed out</fqdn>

Corrected entry:

   <fqdn>nsx-manager1.example.com</fqdn>

Note: There will be multiple <fqdn> entries in the file, one for each NSX Manager in the cluster. Correct all corrupted entries with the appropriate FQDN for each manager.

Save the file and exit the editor.
Restart the nsx-proxy service:

   /etc/init.d/nsx-proxy restart

Verify connectivity to the NSX Manager:

   esxcli network ip connection list | grep 1234

Established connections on port 1234 to the NSX Manager IPs indicate successful connectivity.

Once connectivity is confirmed and the transport node shows as connected in the NSX Manager UI, remove the backup file:

   rm /etc/vmware/nsx/appliance-info.xml.backup

Step 5: Correct appliance-info.xml on Edge nodes

SSH to the affected Edge node as admin.
Backup the current appliance-info.xml file:

   cp /etc/vmware/nsx/appliance-info.xml /etc/vmware/nsx/appliance-info.xml.backup

Edit the appliance-info.xml file:

   vi /etc/vmware/nsx/appliance-info.xml

Locate and correct the corrupted <fqdn> entries with the correct NSX Manager FQDNs, as described above for ESXi hosts.
Save the file and exit the editor.
Restart the nsx-proxy service:

   /etc/init.d/nsx-proxy restart

Verify connectivity to the NSX Manager:

   netstat -anp | grep 1234

Established connections on port 1234 indicate successful connectivity.

Once connectivity is confirmed and the Edge node shows as connected in the NSX Manager UI, remove the backup file:

   rm /etc/vmware/nsx/appliance-info.xml.backup

Verification commands summary

Use these commands to verify the configuration on each component:

NSX Manager: get name-servers
ESXi host: esxcli network ip dns server list
ESXi host: cat /etc/vmware/nsx/appliance-info.xml | grep fqdn
ESXi host: nsxcli -c get managers
Edge node: get name-servers
Edge node: grep fqdn /etc/vmware/nsx/appliance-info.xml

Preventive measures

To prevent this issue from recurring:

Verify DNS server configuration on NSX Managers before performing upgrades or adding transport nodes.
Ensure all configured DNS servers are reachable from the NSX Manager nodes:

   nslookup <nsx-manager-fqdn> <dns-server-ip>

Remove any stale or decommissioned DNS server entries from the NSX Manager configuration before making changes to transport nodes.

If the error persists after following these steps, contact Broadcom Support for further assistance.