Manager FQDN Lookup Failure alarm on NSX UI
search cancel

Manager FQDN Lookup Failure alarm on NSX UI

book

Article ID: 345419

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • NSX UI Alarm

    DNS lookup failed for Manager node {entity_id} with FQDN {appliance_fqdn} and the publish_fqdns flag was set. When trying to resolve the alarm, it changes back to Open state.

  • Edge logs /var/log/syslog

<DATE>T17:15:10.555Z Edge NSX 9226 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="CRITICAL" eventFeatureName="communication" eventType="manager_fqdn_lookup_failure" eventSev="critical" eventState="On"] DNS lookup failed for Manager node <UUID> with FQDN <MANAGER_FQDN> and the publish_fqdns flag was set.

<DATE>T17:55:19.212Z Edge NSX 9226 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="metric-collector"] Metric nsx.communication.manager-fqdn-lookup-failure-status last execution <Future at hex_value state=running> not complete, running for 3832996.569242999 secs

<DATE>T17:56:19.273Z Edge NSX 9226 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="metric-collector"] Metric nsx.communication.manager-fqdn-lookup-failure-status last execution <Future at hex_value state=running> not complete, running for 3833056.6301390156 secs
<DATE>T17:57:19.341Z Edge NSX 9226 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="metric-collector"] Metric nsx.communication.manager-fqdn-lookup-failure-status last execution <Future at hex_value state=running> not complete, running for 3833116.698533103 secs

  • "sha" service on the Edge is stuck querying the Manager host name

# systemctl --no-pager status nsx-sha

             ├─nsx-sha.service
             │ ├─ 1095 /bin/sh /opt/vmware/nsx-netopa/bin/sha_watchdog.sh -s nsx-sha -q 100 -t 1000 -b /var/run/vmware/nsx-sha/watchdog-nsx-sha.BG.PID 
                       /opt/vmware/nsx-netopa/bin/nsx-sha /var/run/vmware/nsx-sha/watchdog-nsx-sha.BG.PID

             │ ├─ 2220 sleep 1
             │ ├─ 9226 /opt/vmware/nsx-netopa/libexec/python-3/bin/python3 /opt/vmware/nsx-netopa/bin/agent.py
             │ ├─ 9228 /opt/vmware/nsx-netopa/libexec/python-3/bin/python3 /opt/vmware/nsx-netopa/bin/agent.py
             │ ├─ 9086 sudo /usr/bin/host <MANAGER_FQDN>
             │ └─ 9087 /usr/bin/host <MANAGER_FQDN>

  • Manually running nslookup and dig from the root shell of the Edge correctly resolves the NSX Manager IP from the fqdn and vice versa.

Environment

VMware NSX 4.x
VMware NSX-T 3.x

Cause

The issue occurs when the metrics script (manager_fqdn_lookup_failure_status.py), which runs on the Edge, gets stuck and fails to complete. The logs indicate that its execution time keeps increasing each minute.

If the script’s execution time exceeds the check interval, the metric collector can enter a hung state. When this happens, the collector stops triggering new metric collections, leading to FQDN Lookup Failure alarms in the NSX UI that cannot be cleared.

Example log entry showing the script running indefinitely:

syslog.20.gz:2023-10-30T10:30:16.088Z nsx-mngr-01.example.com NSX 1101 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp=metric-collector"] Metric nsx.communication.manager-fqdn-lookup-failure-status last execution <Future at 0x69dec02116d0 state=running> not complete, running for 60.06313226791099 secs

Once the metric collector is stuck,the FQDN lookup does not execute again, no further logs related to FQDN Lookup will be generated, and the alarm will persist.

Resolution

This issue is resolved in VMware NSX 3.2.3 available at Broadcom Downloads. The metric collector hang issue doesn't exist in NSX 4.0, including 4.0.1 and 4.0.2, or higher.

Workaround #1 - Restart the nsx-sha service:

Restart nsx-sha service from the node which will trigger the metric-collection again:
/etc/init.d/nsx-sha restart

Workaround 2 - Reboot the Edge:

  1. Place the Edge in maintenance mode:
       - System -> Fabric -> Nodes, select the Edge and then Actions -> Enter NSX Maintenance Mode
  2. Reboot the Edge
  3. Exit maintenance mode:
       - Select the Edge and then Actions -> Exit NSX Maintenance Mode

Additional Information

  • The Edge node periodically runs /usr/bin/dig to resolve the NSX Manager FQDN. If dig is unsuccessful it attempts to resolve using the commands below in sequence

    dig <nsx-manager-FQDN>
    nslookup <nsx-manager-FQDN>
    host <nsx-manager-FQDN>
     
  • When the Edge node attempts to resolve the FQDN, it runs the command as the nsx-sha user. You can test this manually using the command below:
    runuser -m nsx-sha -c "sudo /usr/bin/dig nsx-mngr-01.example.com"

    Note: A failure in this manual test does not necessarily indicate a real issue.

  • To check the exact command used by the system, enable DEBUG logging for nsx-sha and review syslog:
    /opt/vmware/nsx-netopa/bin/sha-appctl -c set_log_level --level debug