Manager FQDN Lookup Failure alarm on NSX UI

Products

VMware NSX

Issue/Introduction

The /usr/bin/dig command runs periodically for manager's FQDN resolution when the manager's FQDNs publish state is set to true.
The output of this lookup if successful, is collected by the metric collector at regular intervals and is reported to the whiteboard, according to which the alarm state is set to true (if FQDN lookup fails) or False (if FQDN lookup succeeds).

In case the FQDN lookup has failed and the metric-collector is in a hung state, the alarm will appear on the NSX UI.
The FQDN lookup failure causes the raised alarm. And then the hanging metric when the long execution happened, makes the FQDN lookup to not execute again and the alarm cannot be cleared.

Manager FQDN Lookup Failure Error appears on NSX UI and we are not able to clear the same.
When the alarm state is changed to resolved manually, they soon reappear on the UI in an Open state.
The nslookup, host and dig commands work fine for the name resolution of manager's FQDN when tried manually, can be verified with:

nslookup <nsx-manager-FQDN>

dig <nsx-manager-FQDN>
host <nsx-manager-FQDN>

NOTE:
"dig" command has been introduced in NSX-T Data Center 3.2.3 onwards. Depending on the codes, the edge nodes can use usr/bin/getent hosts or nslookup commands to resolve the FQDN if dig is not present.
The priority of the commands is: dig, nslookup and “usr/bin/getent hosts”. hosts file is not in use in NSX-T Data Center 3.2.3 or higher.
ESXi hosts use nslookup.

When the node tries to resolve the manager's FQDN, it runs the resolution with username as nsx-sha.
The manual resolution of the same works fine as well:

runuser -m nsx-sha -c "sudo /usr/bin/dig nsx-mngr-01.example.com" >>>>>>>>>>> Failure of this command doesn’t mean a real issue.
The exact command can be verified from syslog after enabling DEBUG level logging for nsx-sha:

/opt/vmware/nsx-netopa/bin/sha-appctl -c set_log_level --level debug

Environment

VMware NSX Data Center 3.2.3

Cause

When execution time becomes longer than check interval, the metric-collector can go into a hung state, and may not trigger any new metric collection again:

syslog.20.gz:2023-10-30T10:30:16.088Z nsx-mngr-01.example.com NSX 1101 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp=metric-collector"] Metric nsx.communication.manager-fqdn-lookup-failure-status last execution <Future at 0x69dec02116d0 state=running> not complete, running for 60.06313226791099 secs

This will result in the FQDN Lookup Failure alarms on NSX UI that cannot be cleared.
The above log shows the metric collector in a running state, and we will not see any further metric collector logs for FQDN Lookup.

Resolution

Issue is resolved in VMware NSX-T Data Center 3.2.4, available at Broadcom downloads.

The metric collector hang issue doesn't exist in NSX 4.0, including 4.0.1 and 4.0.2, or higher.

Workaround:
Restart nsx-sha service from the node which will trigger the metric-collection again:
/etc/init.d/nsx-sha restart

Additional Information

Impact/Risks:
The alarms "Manager FQDN Lookup Failure" will remain in Open state in NSX UI.