- Intermittent "Service Status Unknown" alarms about nsx-opsagent and nsx-cfgagent such as: "The service nsx-opsagent has been unresponsive for 10 seconds"
- On the NSX-T manager, in the /var/log/phonehome-coordinator logs:
2021-11-14T03:53:24.811Z FATAL pool-118-thread-1 MonitoringServiceImpl 4817 MONITORING [nsx@6876 alarmId="########-####-####-####-########f83b" alarmState="OPEN" comp="nsx-manager" entId="########-####-####-####-########04e7" errorCode="MP701099" eventFeatureName="infrastructure_service" eventSev="CRITICAL" eventState="On" eventType="service_status_unknown" level="FATAL" nodeId="########-####-####-####-########7773" subcomp="monitoring"] The service nsx-opsagent has been unresponsive for 10 seconds.
2021-11-14T03:56:04.805Z FATAL pool-118-thread-1 MonitoringServiceImpl 4817 MONITORING [nsx@6876 alarmId="########-####-####-####-########ccca" alarmState="OPEN" comp="nsx-manager" entId="########-####-####-####-########04c5" errorCode="MP701099" eventFeatureName="infrastructure_service" eventSev="CRITICAL" eventState="On" eventType="service_status_unknown" level="FATAL" nodeId="########-####-####-####-########7773" subcomp="monitoring"] The service nsx-cfgagent has been unresponsive for 10 seconds.
On the ESXi hosts, in the nsx-syslog logs:
2021-11-14T03:53:17Z nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO" s2comp="metric-nsx.daemon.daemon-health"] Error in echo Daemon:nsx-opsAgent, error: RPC Call
Failed with CallStatus(code=TIMEOUT), trace:Traceback (most recent call last): File "/usr/lib/vmware/netopa/lib/python/sha/contrib/metric/daemon_health.py", line 226, in echo rsp = daemon_rpc_client.Echo(req, DEFAULT_TIMEOUT) File "/opt/vmware/nsx-nestdb/python/vmware/nsx/rpc/client/client.py", line 244, in call callback=SyncCallContext(timeout, client_context)) File "/usr/lib/vmware/nsx-com
2021-11-14T03:53:17Z nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO" s2comp="fork-monitor"] keep-alive check received resp {'stats': [{'peak_pending_req': 4, 're
sp_error': 0, 'req_error': 0, 'req_dropped_no_resource': 0, 'pending_req': 0, 'req': 1700, 'resp': 1699}, {0: {'req': 523, 'resp': 523}, 1: {'req': 509, 'resp': 509}, 2: {'req': 504, 'resp': 504}, 3: {'req':
85, 'resp': 85}, 4: {'req': 79, 'resp': 79}}], 'timestamp': 1261980.771895807, 'type': 1, 'executor': 3, 'seq': 1699} for req{'timestamp': 1261974.2128648979, 'timeout': 4, 'check': True, 'type': 1, 'seq': 16
99}^@
2021-11-14T03:53:17Z nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO"] PSM:nsx-opsagent ts is 1261979 interval is 42^@
2021-11-14T03:53:17Z nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="CRITICAL" eventState="On" eventFeatureName="infrastructure_service" eventType="service_status_unkn
own" eventSev="critical"] The service nsx-opsagent has been unresponsive for 10 seconds.^@
2021-11-14T03:53:17Z nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO" s2comp="metric-nsx.daemon.daemon-health"] Succeed in destroying rpc client:nsx-opsAgent^@
2021-11-14T03:53:18Z nsx-exporter: NSX 2104636 - [nsx@6876 comp="nsx-esx" subcomp="agg-service" tid="2104662" level="INFO"] Flow count diffs in actual (21966) vs in FLOW_GET_NUMRECORDS (27900)
2021-11-14T03:53:20Z nsx-opsagent[2104462]: NSX 2104462 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsx-rpc" tid="2104473" level="INFO"] RpcConnection[1971 Connected on tcp://127.0.0.1:4098 0] Closing (network error)
2021-11-14T03:53:20Z nsx-opsagent[2104462]: NSX 2104462 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsx-rpc" tid="2104462" level="ERROR" errorCode="RPC103"] RpcCall:Server:Unary[vmware.nsx.daemon.DaemonHealth/Echo, 0x0000, LOCAL_ERROR] Is in error state (COMMUNICATION_ERROR, status is not reported to the Client)
2021-11-14T03:53:20Z nsx-opsagent[2104462]: NSX 2104462 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsx-rpc" tid="2104462" level="ERROR" errorCode="RPC104"] RpcCall:Server:Unary[vmware.nsx.daemon.DaemonHealth/Echo, 0x0000, ERROR] Can't send a message and complete the call (with status SUCCESS) in error state
2021-11-14T03:53:22Z nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO"] PSM:nsx-cfgagent ts is 1262030 interval is 18^@
2021-11-14T03:53:23Z nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO" s2comp="metric-nsx.daemon.daemon-health"] Error in echo Daemon:nsx-cfgAgent, error: RPC Call Failed with CallStatus(code=TIMEOUT), trace:Traceback (most recent call last): File "/usr/lib/vmware/netopa/lib/python/sha/contrib/metric/daemon_health.py", line 226, in echo rsp = daemon_rpc_client.Echo(req, DEFAULT_TIMEOUT) File "/opt/vmware/nsx-nestdb/python/vmware/nsx/rpc/client/client.py", line 244, in call callback=SyncCallContext(timeout, client_context)) File "/usr/lib/vmware/nsx-common/lib/python/google/protobuf/service_reflection.py", line 267, in <lambda> self._StubMethod(
2021-11-14T03:53:24Z nsx-sha: NSX 9543684 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="INFO" s2comp="fork-monitor"] keep_alive req sent: {'timestamp': 1262048.165605505, 'timeout': 4, 'check': True, 'type': 1, 'seq': 1701}^@