After upgrading NSX Manager, transport nodes (ESXi hosts and/or Edge nodes) display MPA connectivity status as DOWN with heartbeat failures. Host preparation fails at 48% with the error "Waiting for Connection to Managers" or "Time out waiting for host to join NSX Manager."
When running get managers from the NSX CLI on an affected ESXi host, the output shows a corrupted FQDN entry containing a DNS error message instead of the proper NSX Manager FQDN:
;; communications error to ##.##.##.##:53: timed out Unable to resolve fqdn
On Edge nodes, the syslog shows nsx-proxy attempting to connect to an invalid hostname containing the DNS error message:
StreamConnection Couldn't resolve 'ssl://;; communications error to ##.##.##.##:53: timed out:1234' (error: 1-Host not found)
The NSX Manager UI shows the transport nodes with connection status DOWN and alarms indicating MPA heartbeat failures.
1. SSH to an affected ESXi host and run:
nsxcli -c get managers
If the output shows a DNS error instead of the NSX Manager FQDN, the host is affected:
;; communications error to <ip>:53: timed out Unable to resolve fqdn
a. Check the appliance-info.xml file for corrupted FQDN entries:
cat /etc/vmware/nsx/appliance-info.xml | grep fqdn
If the output shows entries like <fqdn>;; communications error to ##.##.##.##:53: timed out</fqdn> instead of valid FQDNs, the transport node is affected by this issue.
2. On Edge nodes, check both appliance-info.xml and controller-info.xml for corrupted FQDN entries:
grep fqdn /etc/vmware/nsx/appliance-info.xml
grep fqdn /etc/vmware/nsx/controller-info.xml
Both files may contain corrupted <fqdn> entries when this issue affects an Edge node. Check both files for entries like:
<fqdn>;; communications error to ##.##.##.##:53: timed out</fqdn>
Check the syslog for similar errors:
grep 'communications error' /var/log/syslog | grep mpa-proxy-lib
Look for entries indicating corrupted FQDN values:
ProcessConfig: fqdnv4 = ;; communications error to <ip>:53: timed out
Couldn't resolve 'ssl://;; communications error to <ip>:53: timed out:1234'
3. On the NSX Manager, check nsxapi.log for related errors:
grep -E "Invalid ip string|Time out waiting for host" /var/log/proton/nsxapi.log
Look for entries such as:
Invalid ip string '<fqdn>'. Error parsing '<hostname>'
Time out waiting for host to join NSX Manager
This issue occurs when a stale or unreachable DNS server is configured on the NSX Manager at the time of transport node registration or configuration update.
During processing of a discovery request from a transport node, the NSX Manager's messaging component attempts to resolve FQDNs using the dnsLookupProvider.getFqdnFromIp() method. When the DNS lookup fails due to an unreachable or misconfigured DNS server, the method incorrectly returns the DNS error message (such as ";; communications error to ##.##.##.##:53: timed out") as the FQDN value instead of returning NULL.
This corrupted FQDN string is then sent to the transport nodes via the DiscoveryResponse and persisted in the /etc/vmware/nsx/appliance-info.xml file on all transport nodes. On Edge nodes, the corrupted FQDN is additionally persisted in /etc/vmware/nsx/controller-info.xml. When the transport node attempts to establish management plane connectivity, it tries to resolve these invalid hostname strings, which fails and causes the MPA connection to remain DOWN.
To resolve this issue, first verify and correct DNS name-server configuration on all affected components, then manually correct the corrupted FQDN entries in the XML configuration files on each affected transport node.
Before correcting transport nodes, ensure the NSX Manager has the correct DNS configuration. For detailed instructions, see Updating DNS server details for NSX-T Manager cluster.
get name-servers
del name-server <old-dns-ip>
set name-servers <correct-dns-ip>
get name-servers
esxcli network ip dns server list
esxcli network ip dns server remove --server=<old-dns-ip>
esxcli network ip dns server add --server=<correct-dns-ip>
esxcli network ip dns server list
For detailed instructions on modifying Edge DNS configuration via CLI, see Modify NSX Edge DNS configuration information by command line.
get name-servers
del name-server <old-dns-ip>
set name-servers <correct-dns-ip>
get name-servers
appliance-info.xml file: cp /etc/vmware/nsx/appliance-info.xml /etc/vmware/nsx/appliance-info.xml.backup
appliance-info.xml file in a text editor: vi /etc/vmware/nsx/appliance-info.xml
<fqdn> entries that contain the DNS error message. Replace each corrupted entry with the correct NSX Manager FQDN. Example of corrupted entry: <fqdn>;; communications error to ##.##.##.##:53: timed out</fqdn>
Corrected entry:
<fqdn>nsx-manager1.example.com</fqdn>
Note: There will be multiple
<fqdn>entries in the file, one for each NSX Manager in the cluster. Correct all corrupted entries with the appropriate FQDN for each manager.
/etc/init.d/nsx-proxy restart
esxcli network ip connection list | grep 1234
Established connections on port 1234 to the NSX Manager IPs indicate successful connectivity.
rm /etc/vmware/nsx/appliance-info.xml.backup
On Edge nodes, both /etc/vmware/nsx/appliance-info.xml and /etc/vmware/nsx/controller-info.xml may contain corrupted FQDN entries. Both files must be corrected.
cp /etc/vmware/nsx/appliance-info.xml /etc/vmware/nsx/appliance-info.xml.backup
b. Edit the file:
vi /etc/vmware/nsx/appliance-info.xml
c. Locate and correct all corrupted <fqdn> entries with the correct NSX Manager FQDNs, as described in Step 4 above.
d. Save the file and exit the editor.
cp /etc/vmware/nsx/controller-info.xml /etc/vmware/nsx/controller-info.xml.backup
b. Edit the file:
vi /etc/vmware/nsx/controller-info.xml
c. Locate all corrupted <fqdn> entries. Replace each with the correct NSX Manager FQDN.
Example of corrupted entry:
<fqdn>;; communications error to ##.##.##.##:53: timed out</fqdn>
Corrected entry:
<fqdn>nsx-manager1.example.com</fqdn>
Note: Correct all
<fqdn>entries in this file with the appropriate FQDN for each NSX Manager in the cluster.
d. Save the file and exit the editor.
/etc/init.d/nsx-proxy restart
netstat -anp | grep 1234
Established connections on port 1234 indicate successful connectivity.
rm /etc/vmware/nsx/appliance-info.xml.backup
rm /etc/vmware/nsx/controller-info.xml.backup
| Component | Command |
|---|---|
| NSX Manager | get name-servers |
| ESXi host | esxcli network ip dns server list |
| ESXi host | cat /etc/vmware/nsx/appliance-info.xml | grep fqdn |
| ESXi host | nsxcli -c get managers |
| Edge node | get name-servers |
| Edge node | grep fqdn /etc/vmware/nsx/appliance-info.xml |
| Edge node | grep fqdn /etc/vmware/nsx/controller-info.xml |
To prevent this issue from recurring:
nslookup <nsx-manager-fqdn> <dns-server-ip>
If the error persists after following these steps, contact Broadcom Support for further assistance.