NSX management service may crash when Health Check is enabled
search cancel

NSX management service may crash when Health Check is enabled

book

Article ID: 318496

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • NSX 3.x/4.x.
  • NSX Health Check is enabled.

    System -> Fabric -> Transport Zones -> Health Configuration
  • The NSX UI may raise an alarm "Application Crashed".
  • On the NSX Manager, /var/log/proton/proton-tomcat-wrapper.log shows proton heap generated and the JVM run out of memory.  
    INFO   | jvm 1    | <TIMESTAMP> | java.lang.OutOfMemoryError: Java heap space
    STATUS | wrapper  | <TIMESTAMP> | The JVM has run out of memory.  Requesting thread dump.
    STATUS | wrapper  | <TIMESTAMP> | Dumping JVM state.
    INFO   | jvm 1    | <TIMESTAMP> | Dumping heap to /image/core/proton_oom.hprof ...

    ...
  • Proton out of memory(oom) heap dumps are present in the NSX Manager /image directory

    /image/core/proton_oom.hprof


Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX 4.1.0
VMware NSX-T Data Center 3.x

Cause

In large environments with a very high number of Transport Nodes, Health Check may consume a large amount of memory resulting in proton crashing and restarting.

Resolution

This issue is resolved in VMware NSX-T Data Center 3.2.4 and VMware NSX 4.1.2 available at Broadcom Downloads.
If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.


Workaround

Disable Health Check

  1. Log into the NSX UI
  2. Go to System -> Fabric -> Transport Zones -> Health Configuration
  3. Click Edit and set to Disabled
  4. Click Save

Additional Information

Impact/Risks:
No functional impact may be observed since the proton service is automatically restarted however, if this crash were to occur during a Manager upgrade, it may result in upgrade failure.

If you find that this crash happened in the past, but a regular infrastructure alarm is being reported for it in the NSX UI, then refer to the following KB on how to resolve those alarms: