NSX Manager cluster intermittently degraded due to Proton or Compactor running Out Of Memory
search cancel

NSX Manager cluster intermittently degraded due to Proton or Compactor running Out Of Memory

book

Article ID: 377593

calendar_today

Updated On:

Products

VMware NSX VMware NSX-T Data Center

Issue/Introduction

  • Environment has been upgraded from 3.0/3.1 to 3.2.x/4.x
    • It is also possible to observe this issue if the manager nodes were recently rebooted.
  • NSX manager proton wrapper logs may show out of memory /var/log/proton/proton-tomcat-wrapper.log

21122:STATUS | wrapper | [TIMESTAMP] | The JVM has run out of memory.  Requesting thread dump.
21128:STATUS | wrapper | [TIMESTAMP] | The JVM has run out of memory.  Requesting thread dump.
21137:STATUS | wrapper | [TIMESTAMP] | The JVM has run out of memory.  Requesting thread dump.
21143:STATUS | wrapper | [TIMESTAMP] | The JVM has run out of memory.  Requesting thread dump.

  • Compactor logs may show out of memory /var/log/corfu/corfu-compactor-audit.log

# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="gzip -f /image/core/compactor_oom.hprof"
#   Executing /bin/sh -c "gzip -f /image/core/compactor_oom.hprof"...
Aborting due to java.lang.OutOfMemoryError: Java heap space
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  INVALID (0xe0000000) at pc=0x0000000000000000, pid=14350, tid=[TID]
#  fatal error: OutOfMemory encountered: Java heap space

  • Compactor logs show the ApiTracker table (UUID ########-####-####-####-#######4297a) has a large number of entries 500K >. Example below shows 5 million entries.

find . -iname "corfu-compactor-audit.log" | xargs zgrep -a "completed checkpoint for" | grep "4297a" | tail
[TIMESTAMP] | INFO  |              Cmpt-chkpter-9000 |   o.c.runtime.CheckpointWriter | appendCheckpoint: completed checkpoint for ########-####-####-####-#######4297a, entries(5000000), cpSize([SIZE]) bytes at snapshot Token(epoch=[EPOCH], sequence=[SEQ .No]) in [TIME TO PROCESS] ms

  • In the NSX manager log /var/log/proton/nsxapi.log, we note the absence of the following messages, when a grep is done:

zgrep -a "atches for handling deletion of APIs. Total No. of r" nsxapi.*

INFO RealizationServiceMaintenanceExecutor-0 RealizationServiceMaintenanceManager 2083488 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Created '1' batches for handling deletion of APIs. Total No. of requests = '59', Batch-size = '100'.
INFO RealizationServiceMaintenanceExecutor-0 RealizationServiceMaintenanceManager 2083488 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Created '1' batches for handling deletion of APIs. Total No. of requests = '13', Batch-size = '100'.

  • A core dump was generated with file name proton_oom.hprof.gz:

nsx_manager1> get core-dumps
Directory: /image/core
123456     [TIMESTAMP]  proton_oom.hprof.gz

  • There's an open alarm for Application on NSX node has crashed.

Environment

VMware NSX-T Data Center 3.2.x
VMware NSX 4.x

Cause

The upgrade caused invalid data to be added to the EntityDeletionMarker table.

As a result of this invalid data, the maintenance job that routinely cleans up ApiTracker fails, this is seen by the absence of the log messages in the /var/log/proton/nsxapi.log:

Created '1' batches for handling deletion of APIs. Total No. of requests = '13', Batch-size = '100'

Resolution

If you believe you have encountered this issue, please open a support case with Broadcom Support and refer to this KB article.

For more information, see Creating and managing Broadcom support cases.

Additional Information

If you are contacting Broadcom support about this issue, please provide the following:

  • Retrieve log bundles from all NSX Managers involved.

Handling Log Bundles for offline review with Broadcom support

Removing the core dump to resolve the alarm: