Manager FQDN Lookup Failure alarm on NSX UI
search cancel

Manager FQDN Lookup Failure alarm on NSX UI

book

Article ID: 345419

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

The /usr/bin/dig command runs periodically for manager's FQDN resolution when the manager's FQDNs publish state is set to true.
The output of this lookup if successful, is collected by the metric collector at regular intervals and is reported to the whiteboard, according to which the alarm state is set to true (if FQDN lookup fails) or False (if FQDN lookup succeeds).

In case the FQDN lookup has failed and the metric-collector is in a hung state, the alarm will appear on the NSX UI.
The FQDN lookup failure causes the raised alarm. And then the hanging metric when the long execution happened, makes the FQDN lookup to not execute again and the alarm cannot be cleared.


Symptoms:
  • Manager FQDN Lookup Failure Error appears on NSX UI and we are not able to clear the same.
  • When the alarm state is changed to resolved manually, they soon reappear on the UI in an Open state.
  • The nslookup, host and dig commands work fine for the name resolution of manager's FQDN when tried manually, can be verified with:
nslookup <nsx-manager-FQDN
dig <nsx-manager-FQDN>
host <nsx-manager-FQDN>

 
NOTE: 
"dig" command has been introduced from 3.2.3 version onwards. Depending on the codes, the edge nodes can use
usr/bin/getent hosts or nslookup commands to resolve the FQDN if dig is not present.
The priority of the commands is:  dig, nslookup and “usr/bin/getent hosts”. “host” is not in use from 3.2.3

ESXi hosts use nslookup.
 
  • When the node tries to resolve the manager's FQDN, it runs the resolution with username as nsx-sha.
  • The manual resolution of the same works fine as well:
runuser -m nsx-sha -c "sudo /usr/bin/dig nsx-mngr-01.corp.local" >>>>>>>>>>> Failure of this command doesn’t mean a real issue.
The exact command can be verified from syslog after enabling DEBUG level logging for nsx-sha: 
/opt/vmware/nsx-netopa/bin/sha-appctl -c set_log_level --level debug


Environment

VMware NSX-T

Cause

When execution time becomes longer than check interval, the metric-collector can go into a hung state, and may not trigger any new metric collection again:
 
syslog.20.gz:2023-10-30T10:30:16.088Z nsx-mngr-01.corp.local NSX 1101 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp=metric-collector"] Metric nsx.communication.manager-fqdn-lookup-failure-status last execution <Future at 0x69dec02116d0 state=running> not complete, running for 60.06313226791099 secs

This will result in the FQDN Lookup Failure alarms on NSX UI that cannot be cleared.
The above log shows the metric collector in a running state, and we will not see any further metric collector logs for FQDN Lookup.

Resolution

The metric collector hang issue doesn't exist on 4.0,including 4.0.1 and 4.0.2
Fix will also be included in 3.2.4 version of NSX.

Workaround:
Restart nsx-sha service from the node which will trigger the metric-collection again:
/etc/init.d/nsx-sha restart

Additional Information

Impact/Risks:
The alarms "Manager FQDN Lookup Failure" will remain in Open state on NSX UI.