NSX-T alarms report "Service Status Unknown" intermittently for multiple ESXi hosts
search cancel

NSX-T alarms report "Service Status Unknown" intermittently for multiple ESXi hosts

book

Article ID: 319129

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • Intermittent "Service Status Unknown" alarms about nsx-opsagent and nsx-cfgagent such as: "The service nsx-opsagent has been unresponsive for 10 seconds"
  • On the NSX-T manager, in the /var/log/phonehome-coordinator logs:
2021-11-14T03:53:24.811Z FATAL pool-118-thread-1 MonitoringServiceImpl 4817 MONITORING [nsx@6876 alarmId="########-####-####-####-########f83b" alarmState="OPEN" comp="nsx-manager" entId="########-####-####-####-########04e7" errorCode="MP701099" eventFeatureName="infrastructure_service" eventSev="CRITICAL" eventState="On" eventType="service_status_unknown" level="FATAL" nodeId="########-####-####-####-########7773" subcomp="monitoring"] The service nsx-opsagent has been unresponsive for 10 seconds.

2021-11-14T03:56:04.805Z FATAL pool-118-thread-1 MonitoringServiceImpl 4817 MONITORING [nsx@6876 alarmId="########-####-####-####-########ccca" alarmState="OPEN" comp="nsx-manager" entId="########-####-####-####-########04c5" errorCode="MP701099" eventFeatureName="infrastructure_service" eventSev="CRITICAL" eventState="On" eventType="service_status_unknown" level="FATAL" nodeId="########-####-####-####-########7773" subcomp="monitoring"] The service nsx-cfgagent has been unresponsive for 10 seconds.


On the ESXi hosts, in the nsx-syslog logs:

2021-11-14T03:53:17Z nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO" s2comp="metric-nsx.daemon.daemon-health"] Error in echo Daemon:nsx-opsAgent, error: RPC Call
 Failed with CallStatus(code=TIMEOUT), trace:Traceback (most recent call last):   File "/usr/lib/vmware/netopa/lib/python/sha/contrib/metric/daemon_health.py", line 226, in echo     rsp = daemon_rpc_client.Echo(req, DEFAULT_TIMEOUT)   File "/opt/vmware/nsx-nestdb/python/vmware/nsx/rpc/client/client.py", line 244, in call     callback=SyncCallContext(timeout, client_context))   File "/usr/lib/vmware/nsx-com
2021-11-14T03:53:17Z nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO" s2comp="fork-monitor"] keep-alive check received resp {'stats': [{'peak_pending_req': 4, 're
sp_error': 0, 'req_error': 0, 'req_dropped_no_resource': 0, 'pending_req': 0, 'req': 1700, 'resp': 1699}, {0: {'req': 523, 'resp': 523}, 1: {'req': 509, 'resp': 509}, 2: {'req': 504, 'resp': 504}, 3: {'req':
85, 'resp': 85}, 4: {'req': 79, 'resp': 79}}], 'timestamp': 1261980.771895807, 'type': 1, 'executor': 3, 'seq': 1699} for req{'timestamp': 1261974.2128648979, 'timeout': 4, 'check': True, 'type': 1, 'seq': 16
99}^@

2021-11-14T03:53:17Z nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO"] PSM:nsx-opsagent ts is 1261979 interval is 42^@
2021-11-14T03:53:17Z nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="CRITICAL" eventState="On" eventFeatureName="infrastructure_service" eventType="service_status_unkn
own" eventSev="critical"] The service nsx-opsagent has been unresponsive for 10 seconds.^@
2021-11-14T03:53:17Z nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO" s2comp="metric-nsx.daemon.daemon-health"] Succeed in destroying rpc client:nsx-opsAgent^@
2021-11-14T03:53:18Z nsx-exporter: NSX 2104636 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2104662" level="INFO"] Flow count diffs in actual (21966) vs in FLOW_GET_NUMRECORDS (27900)

2021-11-14T03:53:20Z nsx-opsagent[2104462]: NSX 2104462 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsx-rpc" tid="2104473" level="INFO"] RpcConnection[1971 Connected on tcp://127.0.0.1:4098 0] Closing (network error)
2021-11-14T03:53:20Z nsx-opsagent[2104462]: NSX 2104462 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsx-rpc" tid="2104462" level="ERROR" errorCode="RPC103"] RpcCall:Server:Unary[vmware.nsx.daemon.DaemonHealth/Echo, 0x0000, LOCAL_ERROR] Is in error state (COMMUNICATION_ERROR, status is not reported to the Client)

2021-11-14T03:53:20Z nsx-opsagent[2104462]: NSX 2104462 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsx-rpc" tid="2104462" level="ERROR" errorCode="RPC104"] RpcCall:Server:Unary[vmware.nsx.daemon.DaemonHealth/Echo, 0x0000, ERROR] Can't send a message and complete the call (with status SUCCESS) in error state

2021-11-14T03:53:22Z nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO"] PSM:nsx-cfgagent ts is 1262030 interval is 18^@
2021-11-14T03:53:23Z nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO" s2comp="metric-nsx.daemon.daemon-health"] Error in echo Daemon:nsx-cfgAgent, error: RPC Call Failed with CallStatus(code=TIMEOUT), trace:Traceback (most recent call last):   File "/usr/lib/vmware/netopa/lib/python/sha/contrib/metric/daemon_health.py", line 226, in echo     rsp = daemon_rpc_client.Echo(req, DEFAULT_TIMEOUT)   File "/opt/vmware/nsx-nestdb/python/vmware/nsx/rpc/client/client.py", line 244, in call     callback=SyncCallContext(timeout, client_context))   File "/usr/lib/vmware/nsx-common/lib/python/google/protobuf/service_reflection.py", line 267, in <lambda>     self._StubMethod(
2021-11-14T03:53:24Z nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO" s2comp="fork-monitor"] keep_alive req sent: {'timestamp': 1262048.165605505, 'timeout': 4, 'check': True, 'type': 1, 'seq': 1701}^@



Environment

VMware NSX
VMware NSX-T Data Center

Cause

The alarms are reported according to the heartbeat received by the SHA daemon from the monitored services.
Occasionally, a service main thread could be blocked for some unknown reason and the timer to trigger the heartbeat transmission could not be scheduled. In this case, the SHA daemon will consider the related service down and post an alarm. Once the timer is scheduled again, the alarm is cleared.
When the above behavior persists, the alarms will go on and off intermittently.

Resolution

This is a cosmetic issue and this issue is resolved in VMware NSX 4.0.0.1, available at Broadcom downloads.
If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB

Workaround:
To confirm the alarms triggered match this known cosmetic issue the following steps can be taken.

Check the ESXi host logs in /var/run/log/syslog.log:

1) If the opsAgent service was NOT restarted then the issue is only cosmetic.
2) If the opsAgent service was stopped / restarted then there might be a real issue.

To see if the NSX opsAgent was actually stopped, check /var/run/log/syslog.log for the presence of logging similar to:

2022-05-31T14:41:57Z NSX: opsAgent stop
2022-05-31T14:41:57Z NSX: Shutting down opsAgent watchdog
2022-05-31T14:41:57Z watchdog-opsAgent: Watchdog for opsAgent is now 1055850
2022-05-31T14:41:57Z watchdog-opsAgent: Terminating watchdog process with PID 1055850
2022-05-31T14:41:57Z watchdog-opsAgent: [1055850] Signal received: exiting the watchdog
2022-05-31T14:41:57Z NSX: Shutting down opsAgent service
2022-05-31T14:42:03Z NSX: Unable to shutdown opsAgent service
2022-05-31T14:42:03Z NSX: Terminating opsAgent service

Similarly if the NSX opsAgent was started:

2022-05-31T14:42:50Z NSX: opsAgent start
2022-05-31T14:42:50Z NSX: /var/log/nsx exists
2022-05-31T14:42:50Z NSX: opsAgent start watchdog
2022-05-31T14:42:50Z watchdog-opsAgent: [6349499] Begin '/usr/lib64/vmware/nsx-opsagent/bin/opsAgent', mn-uptime = 60, max-quick-failures = 100, max-total-failures = 1000, bg_pid_file = '', reboot-flag = '0'
2022-05-31T14:42:50Z watchdog-opsAgent: Executing '/usr/lib64/vmware/nsx-opsagent/bin/opsAgent'
2022-05-31T14:42:50Z NSX: opsAgent started
2022-05-31T14:42:51Z opsAgent: Use default syslog
2022-05-31T14:42:51Z opsAgent: NSX 6349509 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="framewok" tid="6349509" level="INFO"] NSX OpsAgent start...