NSX Manager log entry shows the manager_fqdn_lookup_failure_status.py script running indefinitely in /var/log/syslog:
Edge NSX 9226 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="CRITICAL" eventFeatureName="communication" eventType="manager_fqdn_lookup_failure" eventSev="critical" eventState="On"] DNS lookup failed for Manager node <UUID> with FQDN <MANAGER_FQDN> and the publish_fqdns flag was set.
Edge NSX 9226 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="metric-collector"] Metric nsx.communication.manager-fqdn-lookup-failure-status last execution <Future at hex_value state=running> not complete, running for 3832996.569242999 secsEdge NSX 9226 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="metric-collector"] Metric nsx.communication.manager-fqdn-lookup-failure-status last execution <Future athex_valuestate=running> not complete, running for 3833056.6301390156 secsEdge NSX 9226 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="metric-collector"] Metric nsx.communication.manager-fqdn-lookup-failure-status last execution <Future athex_valuestate=running> not complete, running for 3833116.698533103 secs
# systemctl --no-pager status nsx-sha├─nsx-sha.service│ ├─ 1095 /bin/sh /opt/vmware/nsx-netopa/bin/sha_watchdog.sh -s nsx-sha -q 100 -t 1000 -b /var/run/vmware/nsx-sha/watchdog-nsx-sha.BG.PID
/opt/vmware/nsx-netopa/bin/nsx-sha /var/run/vmware/nsx-sha/watchdog-nsx-sha.BG.PID│ ├─ 2220 sleep 1│ ├─ 9226 /opt/vmware/nsx-netopa/libexec/python-3/bin/python3 /opt/vmware/nsx-netopa/bin/agent.py│ ├─ 9228 /opt/vmware/nsx-netopa/libexec/python-3/bin/python3 /opt/vmware/nsx-netopa/bin/agent.py│ ├─ 9086 sudo /usr/bin/host <MANAGER_FQDN>│ └─ 9087 /usr/bin/host <MANAGER_FQDN>
nslookup and dig from the root shell of the Edge correctly resolves the NSX Manager IP from the FQDN and vice versa.VMware NSX-T Data Center 3.x
The issue occurs when the metrics script (manager_fqdn_lookup_failure_status.py) which runs on the Edge gets stuck and fails to complete. The logs indicate that its execution time keeps increasing each minute.
If the script’s execution time exceeds the check interval, the metric collector can enter a hung state. When this happens, the collector stops triggering new metric collections, leading to FQDN Lookup Failure alarms in the NSX UI that cannot be cleared.
Once the metric collector is stuck,the FQDN lookup does not execute again, no further logs related to FQDN Lookup will be generated and the alarm will persist.
This issue is resolved in VMware NSX-T Data Center 3.2.3 available at Broadcom Downloads.
If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.
Workaround 1: Restart the nsx-sha service:
Restart nsx-sha service from the node which will trigger the metric-collection again:/etc/init.d/nsx-sha restart
Workaround 2: Reboot the Edge:
The Edge node periodically runs /usr/bin/dig to resolve the NSX Manager FQDN. If dig is unsuccessful it attempts to resolve using the commands below in sequence:
runuser -m nsx-sha -c "sudo /usr/bin/dig nsx-mngr-01.example.com"
Note: A failure in this manual test does not necessarily indicate a real issue.nsx-sha and review syslog:/opt/vmware/nsx-netopa/bin/sha-appctl -c set_log_level --level debug