Unable to resolve EAM Status Down alarm in NSX-T

Products

VMware NSX

Issue/Introduction

ESXi Agent Manager / EAM Status Down alarms triggered with no apparent EAM issues.
After resolving the alarm, it returns soon after.
An alarm is raised and user resolved with entries similar to the below observed on an NSX Manager node in var/log/phonehome-coordinator/phonehome-coordinator.log

2024-03-05T02:14:26.712Z FATAL http-nio-127.0.0.1-7449-exec-3 MonitoringServiceImpl 18560 MONITORING [nsx@6876 alarmId="cXXXXXX-42a3-0XXX-4XXX-cXXXXXXXX6c6" alarmState="RESOLVED" comp="nsx-manager" entId="9XXXXXX-42a3-0XXX-4XXX-cXXXXXXXX6c6" errorCode="MP701099" eventFeatureName="endpoint_protection" eventSev="CRITICAL" eventState="Off" eventType="eam_status_down" level="FATAL" nodeId="9XXXXXX-42a3-0XXX-4XXX-cXXXXXXXX6c6" subcomp="monitoring"] User resolved.

For the same reporting NSX Manager node (in the above example 91fd1042...), a sync request for the feature side to return the latest status with the following observed in var/log/phonehome-coordinator/phonehome-coordinator.log

2024-03-05T02:14:26.695Z INFO http-nio-127.0.0.1-7449-exec-3 MonitoringFacadeImpl 18560 MONITORING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="monitoring"] bulkSetAndVerifyAlarmsStatus: setting requires sync for user resolved alarm cc582d81-d9aa-4ed3-a691-fd84a881f1e7

On a different NSX Manager node a sync is triggered but fails with the following observed: in var/log/phonehome-coordinator/phonehome-coordinator.log

2024-03-05T02:22:09.448Z INFO pool-45-thread-1 MonitoringSyncService 4471 MONITORING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="monitoring"] Built Sync Request entityId: 9XXXXXX-42a3-0XXX-4XXX-cXXXXXXXX768, eventTypeId: 1, featureId: 13, sourceId: proton_eam_service
.
.
.
2024-03-05T02:22:19.448Z WARN pool-118-thread-1 MonitoringSyncProcessor 4471 MONITORING [nsx@6876 comp="nsx-manager" level="WARNING" subcomp="monitoring"] initiateSyncRequest: unexpected error invoking sync on feature 13 eventType 1 node 9XXXXXX-42a3-0XXX-4XXX-cXXXXXXXX6c6 endpoint 9XXXXXX-42a3-0XXX-4XXX-cXXXXXXXX6c6 source proton_eam_service entity 9XXXXXX-42a3-0XXX-4XXX-cXXXXXXXX6c6: java.util.concurrent.TimeoutException

In subsequent full syncs the alarm is still present as it the original reporting NSX Manager node still has no record of the alarm being resolved with the following observed: in var/log/phonehome-coordinator/phonehome-coordinator.log.

2024-03-05T03:53:01.274Z INFO pool-46-thread-13380 FullSyncRequester 4471 MONITORING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="monitoring"] FullSyncRequester: node 91fd1042-42a3-06ff-40ea-c884ce15d6c6, endpoint c0b8dee6-8d2f-4887-8c34-445c617ef7c1, result true
.
.
.
2024-03-05T03:53:09.211Z FATAL pool-118-thread-1 MonitoringServiceImpl 4471 MONITORING [nsx@6876 alarmId="7eaea370-6e8f-4457-83bb-03202cb17133" alarmState="OPEN" comp="nsx-manager" entId="9XXXXXX-42a3-0XXX-4XXX-cXXXXXXXX6c6" errorCode="MP701099" eventFeatureName="endpoint_protection" eventSev="CRITICAL" eventState="On" eventType="eam_status_down" level="FATAL" nodeId="91fd1042-42a3-06ff-40ea-c884ce15d6c6" subcomp="monitoring"] ESX Agent Manager (EAM) service on compute manager 9XXXXXX-42a3-0XXX-4XXX-cXXXXXXXX6c6 is down.

Environment

VMware NSX-T
VMware NSX-T Data Center

Cause

EAM experiences impact, an alarm is raised while one of the NSX Manager nodes is the clusterEventLeader. The clusterEventLeader changes, following the change, the EAM issue is resolved but the alarm will not clear due to a bug in the alarm framework encountered when the leaders change.

Resolution

This is a known issue impacting NSX.

Workaround:
Restart the proton service on the NSX Manager node reporting the alarm using the below command
#service proton restart