<DATE>T17:15:10.555Z Edge NSX 9226 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="CRITICAL" eventFeatureName="communication" eventType="manager_fqdn_lookup_failure" eventSev="critical" eventState="On"] DNS lookup failed for Manager node <UUID> with FQDN <MANAGER_FQDN> and the publish_fqdns flag was set.
<DATE>T17:55:19.212Z Edge NSX 9226 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="metric-collector"] Metric nsx.communication.manager-fqdn-lookup-failure-status last execution <Future at hex_value state=running> not complete, running for 3832996.569242999 secs<DATE>T17:56:19.273Z Edge NSX 9226 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="metric-collector"] Metric nsx.communication.manager-fqdn-lookup-failure-status last execution <Future at
hex_value
state=running> not complete, running for 3833056.6301390156 secs<DATE>T17:57:19.341Z Edge NSX 9226 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="metric-collector"] Metric nsx.communication.manager-fqdn-lookup-failure-status last execution <Future at
hex_value
state=running> not complete, running for 3833116.698533103 secs
# systemctl --no-pager status nsx-sha
├─nsx-sha.service
│ ├─ 1095 /bin/sh /opt/vmware/nsx-netopa/bin/sha_watchdog.sh -s nsx-sha -q 100 -t 1000 -b /var/run/vmware/nsx-sha/watchdog-nsx-sha.BG.PID
/opt/vmware/nsx-netopa/bin/nsx-sha /var/run/vmware/nsx-sha/watchdog-nsx-sha.BG.PID │ ├─ 2220 sleep 1
│ ├─ 9226 /opt/vmware/nsx-netopa/libexec/python-3/bin/python3 /opt/vmware/nsx-netopa/bin/agent.py
│ ├─ 9228 /opt/vmware/nsx-netopa/libexec/python-3/bin/python3 /opt/vmware/nsx-netopa/bin/agent.py
│ ├─ 9086 sudo /usr/bin/host <MANAGER_FQDN>
│ └─ 9087 /usr/bin/host <MANAGER_FQDN>
nslookup
and dig
from the root shell of the Edge correctly resolves the NSX Manager IP from the fqdn and vice versa.VMware NSX 4.x
VMware NSX-T 3.x
The issue occurs when the metrics script (manager_fqdn_lookup_failure_status.py
), which runs on the Edge, gets stuck and fails to complete. The logs indicate that its execution time keeps increasing each minute.
If the script’s execution time exceeds the check interval, the metric collector can enter a hung state. When this happens, the collector stops triggering new metric collections, leading to FQDN Lookup Failure alarms in the NSX UI that cannot be cleared.
Example log entry showing the script running indefinitely:
This issue is resolved in VMware NSX 3.2.3 available at Broadcom Downloads. The metric collector hang issue doesn't exist in NSX 4.0, including 4.0.1 and 4.0.2, or higher.
Workaround #1 - Restart the nsx-sha service:
Restart nsx-sha service from the node which will trigger the metric-collection again:/etc/init.d/nsx-sha restart
Workaround 2 - Reboot the Edge:
The Edge node periodically runs /usr/bin/dig
to resolve the NSX Manager FQDN. If dig
is unsuccessful it attempts to resolve using the commands below in sequence
runuser -m nsx-sha -c "sudo /usr/bin/dig nsx-mngr-01.example.com"
Note: A failure in this manual test does not necessarily indicate a real issue.nsx-sha
and review syslog
:/opt/vmware/nsx-netopa/bin/sha-appctl -c set_log_level --level debug