False positive Edge service status changed alarm reported on NSX manager UI
search cancel

False positive Edge service status changed alarm reported on NSX manager UI

book

Article ID: 433852

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

NSX Edge nodes may trigger an alarm as "edge_service_status_changed" indicating a transition from "STARTED to STOPPED"

Edge /var/log/syslog traces contains the alarm trigger event and service status changed from "STARTED to STOPPED"

YYYY-MM-DD-T-HH-mm-ss Edge NSX 2465 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING" eventFeatureName="infrastructure_service" eventType="edge_service_status_changed" eventSev="warning" eventState="On" entId="0dec3851-####-####-####-6a8b34f64c85"] The service local-controller changed from STARTED to STOPPED.

YYYY-MM-DD-T-HH-mm-ss Edge NSX 2465 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING" eventFeatureName="infrastructure_service" eventType="edge_service_status_changed" eventSev="warning" eventState="On" entId="5df5a1d5-####-####-####-a75c9651bff4"] The service nsd changed from STARTED to STOPPED.

YYYY-MM-DD-T-HH-mm-ss Edge NSX 2465 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING" eventFeatureName="infrastructure_service" eventType="edge_service_status_changed" eventSev="warning" eventState="On" entId="0e2e2ee3-####-####-####-af60f0f0f28e"] The service dataplane changed from STARTED to STOPPED.

YYYY-MM-DD-T-HH-mm-ss Edge NSX 2465 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING" eventFeatureName="infrastructure_service" eventType="edge_service_status_changed" eventSev="warning" eventState="On" entId="5f0d6336-####-####-####-b8f3f24e5153"] The service router-config changed from STARTED to STOPPED.

YYYY-MM-DD-T-HH-mm-ss Edge NSX 2465 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING" eventFeatureName="infrastructure_service" eventType="edge_service_status_changed" eventSev="warning" eventState="On" entId="dbac7a8d-####-####-####-7a40cd94744c"] The service nestdb changed from STARTED to STOPPED.

YYYY-MM-DD-T-HH-mm-ss Edge NSX 2465 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING" eventFeatureName="infrastructure_service" eventType="edge_service_status_changed" eventSev="warning" eventState="On" entId="acb9e1ec-####-####-####-05f72e472491"] The service dispatcher changed from STARTED to STOPPED.

No corresponding syslog events exist for a transition from "STOPPED to STARTED"

Local verification via >get services on the affected Edge node confirms all services remain in a running state.

No core-dumps generated for any service crashed (>get core-dumps).

The alarm appears to be a false-positive where logs indicate services are STOPPED without an actual change in service state.

Environment

VMware NSX

Cause

Current alarm implementation expects systemctl to return valid output at all times. In the event of a command execution failure, a false-positive alarm is raised even though the service remains running and healthy.

Resolution

This will fix in VMware NSX 4.2.4 and VMware NSX 9.1.1.

Workaround: Execute > get services command from problematic edge node admin cli. If all services are reported as running, the alarm is a false-positive and can be manually resolved/cleared within the NSX Manager UI.

The fix involves a logic update to the monitoring agent's alarm implementation. The agent now validates the specific systemctl exit codes and only triggers an edge_service_status_changed alarm if the service is explicitly reported as inactive or crashed. For any other failures in the execution of the systemctl command (such as timeouts or shell errors), the alarm will not be raised, effectively eliminating the false-positive transitions from STARTED to STOPPED.