NSX Manager cluster degraded and UI inaccessible/Compactor running Out Of Memory
search cancel

NSX Manager cluster degraded and UI inaccessible/Compactor running Out Of Memory

book

Article ID: 378047

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • You are unable to log into one or more of the NSX manager web interface directly (not using VIP).
  • You are unable to login the VIP address of the NSX management cluster.
  • On vSphere, it's not possible to vMotion or power on NSX-backed VMs.
  • You get an alert such as:
Some appliance components are not functioning properly.
Suberror : 15
Error code: 101
  • Logs to verify:
    •     /var/log/cbm/tanuki.log 

root@nsxmgr01:/var/log/cbm# grep "out of memory" tanuki.log

STATUS | wrapper | 2023/09/19 22:44:08 | The JVM has run out of memory. Requesting thread dump.
STATUS | wrapper | 2023/09/19 22:44:08 | The JVM has run out of memory. Restart JVM (Ignoring, already restarting).

    •      /var/log/corfu/corfu-compactor-audit.log

                      grep -i "completed checkpoint for a822bed3-beb0-378a-9eca-3e3b462be3d4" /var/log/corfu/corfu-compactor-audit.log

                      2023-09-18T00:29:04.439Z INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for xxxxxxxx-beb0-378a-9eca-xxxxxxxxxxxx, entries(517175), cpSize(352347166) bytes at snapshot Token(epoch=1657, sequence=4250380630) in 437696 ms

                      2023-09-18T01:39:30.320Z INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for xxxxxxxx-beb0-378a-9eca-xxxxxxxxxxxx, entries(517355), cpSize(352388946) bytes at snapshot Token(epoch=1657, sequence=4250699297) in 437930 ms

  • From above logs we can see that the UUID: xxxxxxxx-beb0-378a-9eca-xxxxxxxxxxxx, which is the Entity Barrier table has huge number of entries

  • The /config partition on one of the manager will grow really high (65% in this case)
    • get cluster status -> output will show DB_SYNCING for DATASTORE
  •  /image/core/*.hprof files created due to corfu/compactor out of memory

Environment

VMware NSX-T Data Center 3.x

VMware NSX

Cause

Stale entries exists in Entity Barrier table which are referencing other deleted objects. As a result of this table being large, causing out of memory issues on corfu and compactor.

Resolution

- Issue is resolved in 4.1.1 and later as there is a fix in the code to automatically detect stale entries and clean them from the Entity Barrier table.

For a workaround, please open a new case with VMware by Broadcom Global Support team, and refer this KB article.

Additional Information

If you are contacting Broadcom support about this issue, please provide the following: