NSX Alarm DNS lookup failed
search cancel

NSX Alarm DNS lookup failed

book

Article ID: 383824

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • NSX UI Alarm

    DNS lookup failed for Manager node {entity_id} with FQDN {appliance_fqdn} and the publish_fqdns flag was set.

  • Edge logs /var/log/syslog

<DATE>T17:15:10.555Z Edge NSX 9226 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="CRITICAL" eventFeatureName="communication" eventType="manager_fqdn_lookup_failure" eventSev="critical" eventState="On"] DNS lookup failed for Manager node <UUID> with FQDN <MANAGER_FQDN> and the publish_fqdns flag was set.

<DATE>T17:55:19.212Z Edge NSX 9226 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="metric-collector"] Metric nsx.communication.manager-fqdn-lookup-failure-status last execution <Future at hex_value state=running> not complete, running for 3832996.569242999 secs

<DATE>T17:56:19.273Z Edge NSX 9226 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="metric-collector"] Metric nsx.communication.manager-fqdn-lookup-failure-status last execution <Future at hex_value state=running> not complete, running for 3833056.6301390156 secs
<DATE>T17:57:19.341Z Edge NSX 9226 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="metric-collector"] Metric nsx.communication.manager-fqdn-lookup-failure-status last execution <Future at hex_value state=running> not complete, running for 3833116.698533103 secs

  • sha service on the Edge is stuck querying the Manager host name

# systemctl --no-pager status nsx-sha

             ├─nsx-sha.service
             │ ├─ 1095 /bin/sh /opt/vmware/nsx-netopa/bin/sha_watchdog.sh -s nsx-sha -q 100 -t 1000 -b /var/run/vmware/nsx-sha/watchdog-nsx-sha.BG.PID /opt/vmware/nsx-netopa/bin/nsx-sha /var/run/vmware/nsx-sha/watchdog-nsx-sha.BG.PID
             │ ├─ 2220 sleep 1
             │ ├─ 9226 /opt/vmware/nsx-netopa/libexec/python-3/bin/python3 /opt/vmware/nsx-netopa/bin/agent.py
             │ ├─ 9228 /opt/vmware/nsx-netopa/libexec/python-3/bin/python3 /opt/vmware/nsx-netopa/bin/agent.py
             │ ├─ 9086 sudo /usr/bin/host <MANAGER_FQDN>
             │ └─ 9087 /usr/bin/host <MANAGER_FQDN>

  • Manually running nslookup and dig from the root shell of the Edge correctly resolves the NSX Manager IP from the fqdn and vice versa

Environment

VMware NSX 4.x
VMware NSX-T 3.x

Cause

The metrics script manager_fqdn_lookup_failure_status.py which runs on the Edge is stuck and cannot complete. The logs show the running time incrementing each minute. 

Resolution

This issue is resolved in VMware NSX 3.2.3 available at Broadcom Downloads.

To workaround this issue reboot the Edge

1. Place the Edge in maintenance mode
    System -> Fabric -> Nodes, select the Edge and then Actions -> Enter NSX Maintenance Mode
2. Reboot the Edge
3. Exit maintenance mode
    select the Edge and then Actions -> Exit NSX Maintenance Mode