Application has crashed on NSX node

Products

VMware NSX

Issue/Introduction

Event ID: infrastructure_service.application_crashed
Alarm Description :

Purpose: This alarm notifies user that an application crash has been reported by the node (with its hostname or id) in alarm description.

Impact: Services have crashed and the appliance generated core or heap dump files.

Alarms similar to the following in the NSX UI :

Application on NSX node <node> has crashed. The number of core files found is 1. Collect the Support Bundle including core dump files and contact VMware Support team. Recommended Action Collect Support Bundle for NSX node <nsx manager> using NSX Manager UI or API.
Messages at /var/log/syslog.log on NSX appliance node (Unified Appliance, Edge, etc), similar to:

2023-05-19T02:50:34.898Z local-manager NSX 85581 MONITORING [nsx@6876 alarmId="e44e47ae-####-####-####-7a1#####d7ee" alarmState="OPEN" comp="nsx-manager" entId="####-####-####-####-####" errorCode="MP701099" eventFeatureName="infrastructure_service" eventSev="CRITICAL" eventState="On" eventType="application_crashed" level="FATAL" nodeId="####-####-####-####-d#####b" subcomp="monitoring"] Application on NSX node local-manager has crashed. The number of core files found is 1. Collect the Support Bundle including core dump files and contact VMware Support team.
In the case of the node being an ESXi host transport node, the same messages as above can be found in /var/log/nsx-syslog.log:

2023-05-18T10:07:31Z nsx-sha: NSX 268653 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="CRITICAL" eventFeatureName="infrastructure_service" eventType="application_crashed" eventSev="critical" eventState="On" entId="####-####-####-####-####"] Application on NSX node has crashed. The number of core files found is 1. Collect the Support Bundle including core dump files and contact VMware Support team.

Validation:

The presence of a crash file is verified in CLI as below:

nsxcli> get core-dumps Directory: /var/log/core 20762624 May 18 2023 11:44:13 UTC core.nginx.1559278043.gz

On an NSX appliance node, verify the associated service is running:

nsxcli> get service <service-name> or nsxcli> get services

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on the environment.

Environment

VMware NSX versions 4.1.1.0 and 4.1.2.4

Cause

Services have crashed, and the system generated the respective core dump files.

All NSX services are configured to be auto-restarted in the event of a crash.

Depending on the application which has crashed, it might be possible other services depending on it may not be functioning correctly.

It is recommended to verify the status of services that have crashed to confirm the running state. In many cases, the alarms are noticed after upgrading the NSX environment and did not appear prior to the upgrade. In these cases, a core dump may have been present for a long time even without any issues having been noticed or any intervention steps taken.

On an NSX Manager, core files can be generated at either /var/log/core/ or /image/core/. There can be external causes, such as network cable failures that can contributed to network redundancy issues and vSAN connectivity problems.

Resolution

This alarm is no longer present in NSX 4.2.1 and above.

To resolve the alarm, delete the core dump files from the respective nodes. This activity has no impact on production.

NSX Appliance Manager and Edge

All core files can be deleted with one command from admin shell
admin> del core-dump all

Alternatively it is possible to delete files one by one

admin> get core-dumps Directory: /var/log/core####### May 18 2023 11:44:13 UTC core.nginx.##########.gz

admin> del core-dump /var/log/core/core.nginx.##########.gz

ESXi host

Execute the following commands in the root shell console of the affected ESXi host:

For NSX version 4.1 or below:

root# rm -f /var/core/*

For NSX version 4.1.1 or above:

root# nsxcli -c del core-dump all

Additional Information

If contacting Broadcom Support for this issue, provide the text of the alarm(s) from the NSX UI as well as the log files and core dump(s). Before deleting any core files, collect the latest support-bundle, adding the option for core dump/sensitive information from the nodes where the application crashed alarm is observed. Please refer to Collect Support Bundles for details on how to collect the support bundle with core and audit logs.

In NSX version 4.1.1 or above, the core dump files can also be removed as part of the collection of a support bundle, with the command: get support-bundlensxcli> get support-bundle file support-bundle.tgz all remove-core-files

If needed individual core dump files can be copied to a remote location from NSX appliance nodes with the admin CLI command: copy core-dump
Note that the full path should be given for the core file, depending on the output of the admin CLI command: get core-dumps
Replace the path and filename with your values.

nsxcli>  get core-dumps
Directory: /var/log/core
########     May 18 2023 11:44:13 UTC  core.nginx.##########.gz

nsxcli> copy core-dump /var/log/core/core.nginx.##########.gz url scp://root@<Remote location IP address>/tmp/
root@<Remote location IP address>'s password:

The following articles detail known application crash issues: