NSX Alarm raised by previously orphaned ESXi Transport Node doesn't clear when the Node is reintroduced.
search cancel

NSX Alarm raised by previously orphaned ESXi Transport Node doesn't clear when the Node is reintroduced.

book

Article ID: 428352

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • An alarm is raised by an ESXi Transport Node that subsequently is orphaned.
  • The host is not unprepared and reintroduced to NSX getting a new UUID.
  • Log lines similar to the below are encountered on the NSX Manager in /var/log/syslog
    It indicates the alarm is raised against the previous UUID.
    NSX 4000 MONITORING [nsx@6876 alarmId="########-####-####-####-############" alarmState="OPEN" comp="nsx-manager" entId="########-####-####-####-############" eventFeatureName="tep_health" eventSev="MEDIUM" eventState="On" eventType="faulty_tep" level="WARNING" nodeId="########-####-####-####-############" subcomp="monitoring"] TEP:vmk10 of VDS:VDSServicio at Transport node:########-####-####-####-########b1ae (old UUID). Overlay workloads using this TEP will face network outage. Reason: all BFD tunnels from TEP are down.

    NSX 4000 MONITORING [nsx@6876 alarmId="5766f735-f4d2-4a01-bce1-c0c2d5013a72" alarmState="OPEN" comp="nsx-manager" entId="########-####-####-####-############" eventFeatureName="tep_health" eventSev="MEDIUM" eventState="On" eventType="faulty_tep" level="WARNING" nodeId="########-####-####-####-############" subcomp="monitoring"] TEP:vmk11 of VDS:VDSServicio at Transport node:########-####-####-####-########b1ae (old UUID). Overlay workloads using this TEP will face network outage. Reason: all BFD tunnels from TEP are down.
  • There is no data plane impact observed for VMs deployed on the host. 
  • The same alarm with the new Transport Node UUID is not found in the NSX Manager syslog. 

 Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Cause

The entityID for the alarm is generated based on Transport Node UUID, when the Transport Node UUID is changed it should automatically clear all alarms generated from the previous Transport Node UUID. The cfgAgent process is not restarted and subsequently the internal alarm cache for the cfgAgent process still retains the alarm raised due to the older entityId being present.

Resolution

This is a known issue impacting VMware NSX.

 

Workaround:

  • SSH the ESXi host as root.
  • Run the command:
    /etc/init.d/nsx-cfgAgent restart

Additional Information

NSX-ALRM-HOST

PR: 3647609

FixedInVersion#:4.2.4