Unable to resolve EAM Status Down alarm in NSX-T
search cancel

Unable to resolve EAM Status Down alarm in NSX-T

book

Article ID: 312616

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • ESXi Agent Manager / EAM Status Down alarms triggered with no apparent EAM issues.
  • After resolving the alarm, it returns soon after. 
  • An alarm is raised and is resolved by the user with entries similar to the below observed on an NSX Manager node in var/log/phonehome-coordinator/phonehome-coordinator.log
FATAL http-nio-127.0.0.1-7449-exec-3 MonitoringServiceImpl 18560 MONITORING [nsx@6876 alarmId="c#######-####-####-####-#########6c6" alarmState="RESOLVED" comp="nsx-manager" entId="9#######-####-####-####-#########768" errorCode="MP701099" eventFeatureName="endpoint_protection" eventSev="CRITICAL" eventState="Off" eventType="eam_status_down" level="FATAL" nodeId="9#######-####-####-####-#########6c6" subcomp="monitoring"] User resolved.
  • For the same reporting NSX Manager node, a sync request is sent for the feature side to return the latest status, with the following entries observed in var/log/phonehome-coordinator/phonehome-coordinator.log
INFO http-nio-127.0.0.1-7449-exec-3 MonitoringFacadeImpl 18560 MONITORING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="monitoring"] bulkSetAndVerifyAlarmsStatus: setting requires sync for user resolved alarm c#######-####-####-####-#########1e7
  • On a different NSX Manager node a sync is triggered but fails with the following observed in var/log/phonehome-coordinator/phonehome-coordinator.log
INFO pool-45-thread-1 MonitoringSyncService 4471 MONITORING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="monitoring"] Built Sync Request entityId: 9#######-####-####-####-#########768, eventTypeId: 1, featureId: 13, sourceId: proton_eam_service
.
.
.
WARN pool-118-thread-1 MonitoringSyncProcessor 4471 MONITORING [nsx@6876 comp="nsx-manager" level="WARNING" subcomp="monitoring"] initiateSyncRequest: unexpected error invoking sync on feature 13 eventType 1 node 9#######-####-####-####-#########6c6 endpoint 9#######-####-####-####-#########6c6 source proton_eam_service entity 9#######-####-####-####-#########768: java.util.concurrent.TimeoutException
  • In subsequent full syncs the alarm is still present as it the original reporting NSX Manager node still has no record of the alarm being resolved with the following observed in var/log/phonehome-coordinator/phonehome-coordinator.log.
INFO pool-46-thread-13380 FullSyncRequester 4471 MONITORING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="monitoring"] FullSyncRequester: node 9#######-####-####-####-#########6c6, endpoint 9#######-####-####-####-#########768, result true
.
.
.
FATAL pool-118-thread-1 MonitoringServiceImpl 4471 MONITORING [nsx@6876 alarmId="c#######-####-####-####-#########6c6" alarmState="OPEN" comp="nsx-manager" entId="9#######-####-####-####-#########6c6" errorCode="MP701099" eventFeatureName="endpoint_protection" eventSev="CRITICAL" eventState="On" eventType="eam_status_down" level="FATAL" nodeId="9#######-####-####-####-#########768" subcomp="monitoring"] ESX Agent Manager (EAM) service on compute manager 9#######-####-####-####-#########6c6 is down.


Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX
VMware NSX-T Data Center

Cause

EAM experiences impact, an alarm is raised while one of the NSX Manager nodes is the clusterEventLeader. The clusterEventLeader changes and following the change, the EAM issue is resolved but the alarm will not clear due to a bug in the alarm framework encountered when the leaders change.

Resolution

This issue is resolved in VMware NSX 4.2.0, available at Broadcom downloads.

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.



Workaround:

Restart the proton service on the NSX Manager node reporting the alarm "EAM Status Down" using the below command:

Log in to the appliance as root.

root@nsx-mngr:~# service proton restart 

Additional Information