Application has crashed alarm for NSX Edge node and "core.resolver-execut.<timestamp>.gz" core dumps found
search cancel

Application has crashed alarm for NSX Edge node and "core.resolver-execut.<timestamp>.gz" core dumps found

book

Article ID: 373772

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • You are seeing alarms similar to the following in the NSX UI :
Application on NSX node <node> has crashed. The number of core files found is 1. Collect the Support Bundle including core dump files and contact VMware Support team. Recommended Action Collect Support Bundle for NSX node <nsx manager> using NSX Manager UI or API.
  • Checking /var/log/syslog.log on NSX appliance node (Unified Appliance, Edge, etc), you can see messages similar to:
2023-05-19T02:50:34.898Z local-manager NSX 85581 MONITORING [nsx@6876 alarmId="#######-8c4c-47aa-85a9-#########" alarmState="OPEN" comp="nsx-manager" entId="340cd33e-####-####-####-ff3b6fc90faf" errorCode="MP701099" eventFeatureName="infrastructure_service" eventSev="CRITICAL" eventState="On" eventType="application_crashed" level="FATAL" nodeId="d1be0142-####-####-####-d5ae7b37180b" subcomp="monitoring"] Application on NSX node local-manager has crashed. The number of core files found is 1. Collect the Support Bundle including core dump files and contact VMware Support team.
  • When checking the contents of /var/core on the NSX Edge reported in the alarm, files named "core.resolver-execut.<timestamp>.gz are seen.
-rw-r--r--  1 root root  21M XXX  3 13:50 core.resolver-execut.XXXXXX22639.1253.991.6.gz
-rw-r--r--  1 root root  21M XXX  3 15:14 core.resolver-execut.XXXXXX7679.3203641.991.6.gz
  • Checking /var/log/syslog.log on the NSX Edge reported in the alarm, the following log print referencing "failed to connect to all addresses" is seen around the same time as the core.resolver-execut.<timestamp>.gz file was generated. 
2024-06-03T13:50:29.234Z ###-###-edge-a.######.####.com NSX 1253 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="tsdb-sender-napp"] Failed to send one msg timestamp: 1717421738#012entity: TIER0#012entity_id: "#######-4601-419b-a687-############"#012node_id: "#######-d25c-4a3c-9c65-##########"#012nsx_site_id: "#######-434e-4c51-b3af-##########"#012gfw {#012  obj_id: "#######-bb10-48f8-97d4-##########"#012  number_of_sessions: 0#012  number_of_bytes: 48986264#012}#012 from plugin #######-7846-417e-bf8d-##########:#012 <_InactiveRpcError of RPC that terminated with:#012#011status = StatusCode.UNAVAILABLE#012#011details = "failed to connect to all addresses"#012#011debug_error_string = "{"created":"@1717422629.233920228","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1717422629.233918134","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"#012>#012 Traceback (most recent call last):#012  File "/opt/vmware/nsx-netopa/lib/python/sha/core/channel/provider/tsdb_provider.py", line 671, in send_metrics#012    response = self._metric_stub.MetricsUpdate(msg, timeout=transmit_timeout,#012  File "/opt/vmware/nsx-netopa/lib/python/grpc/_channel.py", line 946, in __call__#012    return _end_unary_response_blocking(state, call, False, None)#012  File "/opt/vmware/nsx-netopa/lib/python/grpc/_channel.py", line 849, in _end_unary_response_blocking#012    raise _InactiveRpcError(state)#012grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:#012#011status = StatusCode.UNAVAILABLE#012#011details = "failed to connect to all addresses"#012#011debug_error_string = "{"created":"@1717422629.233920228","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1717422629.233918134","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"#012>

Environment

VMware NSX 4.x

Cause

The resolver-execute process is an Edge DNS process using grpc. A bug in grpc can cause this process to crash during network interruptions such as network related maintenances or outages. 

Resolution

The fix is addressed on VMware NSX 4.2.1 Release and above.

Note: As a workaround, Watchdog will automatically restart this resolver process after a crash. 

Additional Information

Please reference KB 345792 for steps to clear the core dump files and the related VMware NSX alarm once the files are no longer needed.