NSX-T alarms report "Service Status Unknown" intermittently for multiple ESXi hosts

Products

VMware NSX

Issue/Introduction

Intermittent "Service Status Unknown" alarms about nsx-opsagent and nsx-cfgagent such as: "The service nsx-opsagent has been unresponsive for 10 seconds"
On the NSX-T manager, following events are reported regarding the ESXi hosts in the /var/log/phonehome-coordinator logs:

YYYY-MM-DDTHH:MM:SS FATAL pool-118-thread-1 MonitoringServiceImpl 4817 MONITORING [nsx@6876 alarmId="########-####-####-####-##########" alarmState="OPEN" comp="nsx-manager" entId="########-####-####-####-############" errorCode="MP701099" eventFeatureName="infrastructure_service" eventSev="CRITICAL" eventState="On" eventType="service_status_unknown" level="FATAL" nodeId="########-####-####-####-############" subcomp="monitoring"] The service nsx-opsagent has been unresponsive for 10 seconds.

YYYY-MM-DDTHH:MM:SS FATAL pool-118-thread-1 MonitoringServiceImpl 4817 MONITORING [nsx@6876 alarmId="########-####-####-####-########ccca" alarmState="OPEN" comp="nsx-manager" entId="########-####-####-####-############" errorCode="MP701099" eventFeatureName="infrastructure_service" eventSev="CRITICAL" eventState="On" eventType="service_status_unknown" level="FATAL" nodeId="########-####-####-####-############" subcomp="monitoring"] The service nsx-cfgagent has been unresponsive for 10 seconds

Following event logs are reported on the impacted ESXi hosts, in the /var/run/log/nsx-syslog logs:

YYYY-MM-DDTHH:MM:SS nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO" s2comp="metric-nsx.daemon.daemon-health"] Error in echo Daemon:nsx-opsAgent, error: RPC Call Failed with CallStatus(code=TIMEOUT), trace:Traceback (most recent call last): File "/usr/lib/vmware/netopa/lib/python/sha/contrib/metric/daemon_health.py", line 226, in echo rsp = daemon_rpc_client.Echo(req, DEFAULT_TIMEOUT) File "/opt/vmware/nsx-nestdb/python/vmware/nsx/rpc/client/client.py", line 244, in call callback=SyncCallContext(timeout, client_context)) File "/usr/lib/vmware/nsx-com

YYYY-MM-DDTHH:MM:SS nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO" s2comp="fork-monitor"] keep-alive check received resp {'stats': [{'peak_pending_req': 4, 'resp_error': 0, 'req_error': 0, 'req_dropped_no_resource': 0, 'pending_req': 0, 'req': 1700, 'resp': 1699}, {0: {'req': 523, 'resp': 523}, 1: {'req': 509, 'resp': 509}, 2: {'req': 504, 'resp': 504}, 3: {'req': 85, 'resp': 85}, 4: {'req': 79, 'resp': 79}}], 'timestamp': 1261980.771895807, 'type': 1, 'executor': 3, 'seq': 1699} for req{'timestamp': 1261974.2128648979, 'timeout': 4, 'check': True, 'type': 1, 'seq': 1699}^@

YYYY-MM-DDTHH:MM:SS nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO"] PSM:nsx-opsagent ts is 1261979 interval is 42^@

YYYY-MM-DDTHH:MM:SS nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="CRITICAL" eventState="On" eventFeatureName="infrastructure_service" eventType="service_status_unknown" eventSev="critical"] The service nsx-opsagent has been unresponsive for 10 seconds.^@

YYYY-MM-DDTHH:MM:SS nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO" s2comp="metric-nsx.daemon.daemon-health"] Succeed in destroying rpc client:nsx-opsAgent^@

YYYY-MM-DDTHH:MM:SS nsx-exporter: NSX 2104636 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2104662" level="INFO"] Flow count diffs in actual (21966) vs in FLOW_GET_NUMRECORDS (27900)

YYYY-MM-DDTHH:MM:SS nsx-opsagent[2104462]: NSX 2104462 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsx-rpc" tid="2104473" level="INFO"] RpcConnection[1971 Connected on tcp://127.0.0.1:4098 0] Closing (network error)

YYYY-MM-DDTHH:MM:SS nsx-opsagent[2104462]: NSX 2104462 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsx-rpc" tid="2104462" level="ERROR" errorCode="RPC103"] RpcCall:Server:Unary[vmware.nsx.daemon.DaemonHealth/Echo, 0x0000, LOCAL_ERROR] Is in error state (COMMUNICATION_ERROR, status is not reported to the Client)

.....

YYYY-MM-DDTHH:MM:SS nsx-opsagent[2104462]: NSX 2104462 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsx-rpc" tid="2104462" level="ERROR" errorCode="RPC104"] RpcCall:Server:Unary[vmware.nsx.daemon.DaemonHealth/Echo, 0x0000, ERROR] Can't send a message and complete the call (with status SUCCESS) in error state

YYYY-MM-DDTHH:MM:SS nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO"] PSM:nsx-cfgagent ts is 1262030 interval is 18^@

YYYY-MM-DDTHH:MM:SS nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO" s2comp="metric-nsx.daemon.daemon-health"] Error in echo Daemon:nsx-cfgAgent, error: RPC Call Failed with CallStatus(code=TIMEOUT), trace:Traceback (most recent call last): File "/usr/lib/vmware/netopa/lib/python/sha/contrib/metric/daemon_health.py", line 226, in echo rsp = daemon_rpc_client.Echo(req, DEFAULT_TIMEOUT) File "/opt/vmware/nsx-nestdb/python/vmware/nsx/rpc/client/client.py", line 244, in call callback=SyncCallContext(timeout, client_context)) File "/usr/lib/vmware/nsx-common/lib/python/google/protobuf/service_reflection.py", line 267, in <lambda> self._StubMethod(

YYYY-MM-DDTHH:MM:SS nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO" s2comp="fork-monitor"] keep_alive req sent: {'timestamp': 1262048.165605505, 'timeout': 4, 'check': True, 'type': 1, 'seq': 1701}^@

Environment

VMware NSX
VMware NSX-T Data Center

Cause

The alarms are reported according to the heartbeat received by the SHA daemon from the monitored services.
Occasionally, a service main thread could be blocked for some unknown reason and the timer to trigger the heartbeat transmission could not be scheduled. In this case, the SHA daemon will consider the related service down and post an alarm. Once the timer is scheduled again, the alarm is cleared.
When the above behavior persists, the alarms will go on and off intermittently.

Resolution

This is a cosmetic issue and this issue is resolved in VMware NSX 4.0.0.1, available at Broadcom downloads.

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB

Workaround:

To confirm the alarms triggered match this known cosmetic issue the following steps can be taken.

Check the ESXi host logs in /var/run/log/syslog.log:

If the opsAgent service was NOT restarted then the issue is only cosmetic.
If the opsAgent service was stopped / restarted then there might be a real issue.

To see if the NSX opsAgent was actually stopped, check /var/run/log/syslog.log for the presence of logging similar to:

YYYY-MM-DDTHH:MM:SS NSX: opsAgent stop
YYYY-MM-DDTHH:MM:SS NSX: Shutting down opsAgent watchdog
YYYY-MM-DDTHH:MM:SS watchdog-opsAgent: Watchdog for opsAgent is now 1055850
YYYY-MM-DDTHH:MM:SS watchdog-opsAgent: Terminating watchdog process with PID 1055850
YYYY-MM-DDTHH:MM:SS watchdog-opsAgent: [1055850] Signal received: exiting the watchdog
YYYY-MM-DDTHH:MM:SS NSX: Shutting down opsAgent service
YYYY-MM-DDTHH:MM:SS NSX: Unable to shutdown opsAgent service
YYYY-MM-DDTHH:MM:SS NSX: Terminating opsAgent service

Similarly if the NSX opsAgent was started:

YYYY-MM-DDTHH:MM:SS NSX: opsAgent start
YYYY-MM-DDTHH:MM:SS NSX: /var/log/nsx exists
YYYY-MM-DDTHH:MM:SS NSX: opsAgent start watchdog
YYYY-MM-DDTHH:MM:SS watchdog-opsAgent: [6349499] Begin '/usr/lib64/vmware/nsx-opsagent/bin/opsAgent', mn-uptime = 60, max-quick-failures = 100, max-total-failures = 1000, bg_pid_file = '', reboot-flag = '0'
YYYY-MM-DDTHH:MM:SS watchdog-opsAgent: Executing '/usr/lib64/vmware/nsx-opsagent/bin/opsAgent'
YYYY-MM-DDTHH:MM:SS NSX: opsAgent started
YYYY-MM-DDTHH:MM:SS opsAgent: Use default syslog
YYYY-MM-DDTHH:MM:SS opsAgent: NSX 6349509 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="framewok" tid="6349509" level="INFO"] NSX OpsAgent start...

Additional Information

For more information regarding the NSX version 4.0.0.1, please go through the release notes: NSX 4.0.0.1 release note