After upgrading NSX Manager, transport nodes (ESXi hosts and/or Edge nodes) display MPA connectivity status as DOWN with heartbeat failures. Host preparation fails at 48% with the error "Waiting for Connection to Managers" or "Time out waiting for host to join NSX Manager."
When running get managers from the NSX CLI on an affected ESXi host, the output shows a corrupted FQDN entry containing a DNS error message instead of the proper NSX Manager FQDN:
;; communications error to ##.##.##.##:53: timed out Unable to resolve fqdn
On Edge nodes, the syslog shows nsx-proxy attempting to connect to an invalid hostname containing the DNS error message:
StreamConnection Couldn't resolve 'ssl://;; communications error to ##.##.##.##:53: timed out:1234' (error: 1-Host not found)
The NSX Manager UI shows the transport nodes with connection status DOWN and alarms indicating MPA heartbeat failures.
nsxcli -c get managers
If the output shows a DNS error instead of the NSX Manager FQDN, the host is affected:
;; communications error to <ip>:53: timed out Unable to resolve fqdn
cat /etc/vmware/nsx/appliance-info.xml | grep fqdn
If the output shows entries like <fqdn>;; communications error to ##.##.##.##:53: timed out</fqdn> instead of valid FQDNs, the transport node is affected by this issue.
grep 'communications error' /var/log/syslog | grep mpa-proxy-lib
Look for entries indicating corrupted FQDN values:
ProcessConfig: fqdnv4 = ;; communications error to <ip>:53: timed out
Couldn't resolve 'ssl://;; communications error to <ip>:53: timed out:1234'
grep -E "Invalid ip string|Time out waiting for host" /var/log/proton/nsxapi.log
Look for entries such as:
Invalid ip string '<fqdn>'. Error parsing '<hostname>'
Time out waiting for host to join NSX Manager
This issue occurs when a stale or unreachable DNS server is configured on the NSX Manager at the time of transport node registration or configuration update.
During processing of a discovery request from a transport node, the NSX Manager's messaging component attempts to resolve FQDNs using the dnsLookupProvider.getFqdnFromIp() method. When the DNS lookup fails due to an unreachable or misconfigured DNS server, the method incorrectly returns the DNS error message (such as ";; communications error to ##.##.##.##:53: timed out") as the FQDN value instead of returning NULL.
This corrupted FQDN string is then sent to the transport nodes via the DiscoveryResponse and persisted in the /etc/vmware/nsx/appliance-info.xml file. When the transport node attempts to establish management plane connectivity, it tries to resolve this invalid hostname string, which fails and causes the MPA connection to remain DOWN.
To resolve this issue, first verify and correct DNS name-server configuration on all affected components, then manually correct the corrupted FQDN entries in the appliance-info.xml file on each affected transport node.
Before correcting transport nodes, ensure the NSX Manager has the correct DNS configuration. For detailed instructions, see Updating DNS server details for NSX-T Manager cluster.
get name-servers
del name-server <old-dns-ip>
set name-servers <correct-dns-ip>
get name-servers
esxcli network ip dns server list
esxcli network ip dns server remove --server=<old-dns-ip>
esxcli network ip dns server add --server=<correct-dns-ip>
esxcli network ip dns server list
For detailed instructions on modifying Edge DNS configuration via CLI, see Modify NSX Edge DNS configuration information by command line.
get name-servers
del name-server <old-dns-ip>
set name-servers <correct-dns-ip>
get name-servers
cp /etc/vmware/nsx/appliance-info.xml /etc/vmware/nsx/appliance-info.xml.backup
vi /etc/vmware/nsx/appliance-info.xml
<fqdn> entries that contain the DNS error message. Replace each corrupted entry with the correct NSX Manager FQDN. Example of corrupted entry: <fqdn>;; communications error to ##.##.##.##:53: timed out</fqdn>
Corrected entry:
<fqdn>nsx-manager1.example.com</fqdn>
Note: There will be multiple <fqdn> entries in the file, one for each NSX Manager in the cluster. Correct all corrupted entries with the appropriate FQDN for each manager.
/etc/init.d/nsx-proxy restart
esxcli network ip connection list | grep 1234
Established connections on port 1234 to the NSX Manager IPs indicate successful connectivity.
rm /etc/vmware/nsx/appliance-info.xml.backup
cp /etc/vmware/nsx/appliance-info.xml /etc/vmware/nsx/appliance-info.xml.backup
vi /etc/vmware/nsx/appliance-info.xml
<fqdn> entries with the correct NSX Manager FQDNs, as described above for ESXi hosts. /etc/init.d/nsx-proxy restart
netstat -anp | grep 1234
Established connections on port 1234 indicate successful connectivity.
rm /etc/vmware/nsx/appliance-info.xml.backup
Use these commands to verify the configuration on each component:
get name-serversesxcli network ip dns server listcat /etc/vmware/nsx/appliance-info.xml | grep fqdnnsxcli -c get managersget name-serversgrep fqdn /etc/vmware/nsx/appliance-info.xmlTo prevent this issue from recurring:
nslookup <nsx-manager-fqdn> <dns-server-ip>
If the error persists after following these steps, contact Broadcom Support for further assistance.