Application on NSX node has crashed alarm

Products

VMware NSX

Issue/Introduction

Event ID: infrastructure_service.application_crashed
Alarm Description :-

Purpose: This alarm notifies user that an application crash has been reported by node (with its hostname or id) in alarm description.
Impact: Services have crashed and the appliance generated their respective core or heap dump files.
- Alarms similar to the following in the NSX UI :
  Application on NSX node <node> has crashed. The number of core files found is 1. Collect the Support Bundle including core dump files and contact VMware Support team. Recommended Action Collect Support Bundle for NSX node <nsx manager> using NSX Manager UI or API.
- Messages at /var/log/syslog.log on NSX appliance node (Unified Appliance, Edge, etc), similar to:
  2023-05-19T02:50:34.898Z local-manager NSX 85581 MONITORING [nsx@6876 alarmId="e44e47ae-####-####-####-7a159b72d7ee" alarmState="OPEN" comp="nsx-manager" entId="340cd33e-####-####-####-ff3b6fc90faf" errorCode="MP701099" eventFeatureName="infrastructure_service" eventSev="CRITICAL" eventState="On" eventType="application_crashed" level="FATAL" nodeId="d1be0142-####-####-####-d5ae7b37180b" subcomp="monitoring"] Application on NSX node local-manager has crashed. The number of core files found is 1. Collect the Support Bundle including core dump files and contact VMware Support team.
- In the case of the node being an ESXi host transport node, the same messages as above can be found in /var/log/nsx-syslog.log:
  2023-05-18T10:07:31Z nsx-sha: NSX 268653 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="CRITICAL" eventFeatureName="infrastructure_service" eventType="application_crashed" eventSev="critical" eventState="On" entId="76a85727-####-####-####-a064668252f0"] Application on NSX node has crashed. The number of core files found is 1. Collect the Support Bundle including core dump files and contact VMware Support team.

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on the environment.

Environment

VMware NSX 4.x

This alarm is no longer present in NSX 4.2.1 and above.

Cause

Services have crashed, and the system generated the respective core dump files. All NSX services are configured to be auto-restarted in the event of a crash. Depending on the application which has crashed, it might be possible other services depending on it may not be functioning correctly. It is recommended to verify the status of services that have crashed to confirm the running state. On an NSX Manager, core files can be generated at either /var/log/core/ or /image/core/.

On an NSX appliance node, verity a service status over CLI as below:
nsxcli> get service <service-name> or nsxcli> get services
An application crash should generate a core or heap dump on the NSX node, which can be verified in CLI as below:
nsxcli> get core-dumps Directory: /var/log/core 20762624 May 18 2023 11:44:13 UTC core.nginx.1559278043.gz

Note: In the above example output, the service nginx crashed and the system generated a core dump file.

nsx_manager1> get core-dumps
Directory: /image/core
123456 Aug 30 2024 18:00:04 UTC proxy_oom.hprof
Note: In the above example output, the proxy service has had an out of memory crash.

Resolution

Recommended Action:

NSX services are configured to auto-restart after experiencing a crash. Alarms are generated to draw attention to such crashes so that users can ensure their environment is running properly.
In many cases, the alarms are noticed after upgrading the NSX environment and did not appear prior to the upgrade. In these cases, a core dump may have been present for a long time even without any issues having been noticed or any intervention steps taken.
In some cases, an application crash may cause dependent services to not function correctly, so it is recommended to verify the services status to confirm all the related services are running. Generally, it is not expected to find service issues.
Though services will normally auto-restart without any additional problems, if repeated crashes are noticed or there are indications of service issues present, customers should engage Broadcom support for verification or additional analysis.

In order to report application crash issues, use the steps below:

1. Collect the latest support-bundle, adding the option for core dump and audit logs from the nodes where the application crashed alarm is observed. Please refer to Collect Support Bundles for details on how to collect the support bundle with core and audit logs.
2. Individual core dump files can be copied to a remote location from NSX appliance nodes with the admin CLI command: copy core-dump
  Note that the full path should be given for the core file, depending on the output of the admin CLI command: get core-dumps
  Replace the path and filename with your values.
  
  nsxcli> get core-dumps Directory: /var/log/core 20762624 May 18 2023 11:44:13 UTC core.nginx.1559278043.gz nsxcli> copy core-dump /var/log/core/core.nginx.1559278043.gz url scp://root@192.168.210.200/tmp/ root@192.168.210.200's password:
3. If contacting Broadcom Support for this issue, provide the text of the alarm(s) from the NSX UI as well as the log files and core dump(s).
4. After collecting the support-bundle, the application crashed alarm can be resolved by removing the core dump files from the respective nodes.
  1. On NSX appliance nodes, core and heap dump files can be removed with the command: del core-dump
    Note that the full path should be given for the core file, depending on the output of the command: get core-dumps
    Replace the path and filename with your values.nsxcli> get core-dumps Directory: /var/log/core 20762624 May 18 2023 11:44:13 UTC core.nginx.1559278043.gz nsxcli> del core-dump /var/log/core/core.nginx.1559278043.gzornsxcli> del core-dump allUsing the "all" option above will delete cores from all locations where they may be generated on that appliance. For example, on the NSX Manager it will remove from both /var/log/core/ and /image/core/.
    
    In NSX version 4.1.1 or above, the core dump files can also be removed as part of the collection of a support bundle, with the command: get support-bundlensxcli> get support-bundle file support-bundle.tgz all remove-core-files
  2. On ESXi host transport nodes, following commands can be used respective of NSX version to remove core dump files:
    1. For NSX version 4.1 or below:
      command below to be executed in shell console of ESXi host:
      
      root> rm -rf /var/core
    2. For NSX version 4.1.1 or above:
      command below to be executed in NSX CLI of ESXi host:
      
      nsxcli> del core-dump all
      or
      nsxcli> del core-dump <core-dump-file>

Maintenance window required for remediation? No

Additional Information

The following articles detail known core dump issues and steps to know for which service crash a core dump was generated.