The /usr/bin/dig command runs periodically for manager's FQDN resolution when the manager's FQDNs publish state is set to true.
The output of this lookup if successful, is collected by the metric collector at regular intervals and is reported to the whiteboard, according to which the alarm state is set to true (if FQDN lookup fails) or False (if FQDN lookup succeeds).
In case the FQDN lookup has failed and the metric-collector is in a hung state, the alarm will appear on the NSX UI.
The FQDN lookup failure causes the raised alarm. And then the hanging metric when the long execution happened, makes the FQDN lookup to not execute again and the alarm cannot be cleared.
- Manager FQDN Lookup Failure Error appears on NSX UI and we are not able to clear the same.
- When the alarm state is changed to resolved manually, they soon reappear on the UI in an Open state.
- The
nslookup
, host
and dig
commands work fine for the name resolution of manager's FQDN when tried manually, can be verified with:
nslookup <nsx-manager-FQDN>
dig <nsx-manager-FQDN>
host <nsx-manager-FQDN>
NOTE:
"dig
" command has been introduced in NSX-T Data Center 3.2.3 onwards. Depending on the codes, the edge nodes can use usr/bin/getent hosts
or nslookup
commands to resolve the FQDN if dig is not present.
The priority of the commands is: dig
, nslookup
and “usr/bin/getent hosts
”. hosts file is not in use in NSX-T Data Center 3.2.3 or higher.
ESXi hosts use nslookup
.
- When the node tries to resolve the manager's FQDN, it runs the resolution with username as nsx-sha.
- The manual resolution of the same works fine as well:
runuser -m nsx-sha -c "sudo /usr/bin/dig nsx-mngr-01.example.com"
>>>>>>>>>>> Failure of this command doesn’t mean a real issue.
The exact command can be verified from syslog after enabling DEBUG level logging for nsx-sha:
/opt/vmware/nsx-netopa/bin/sha-appctl -c set_log_level --level debug