NSX-T UI becomes unavailable with Error code 101 when using AD with IDFW

Products

VMware vDefend Firewall VMware NSX

Issue/Introduction

The NSX-T manager UI becomes unavailable displaying the following error:

Some appliance components are not functioning properly. 
Component health: SEARCH:UNKNOWN, MANAGER:UNKNOWN, NODE_MGMT:UP, UI:UP.
Error code: 101

In NSX-T manager cli as admin, running the following command fails:

get cluster status

In the NSX-T managers as root user you see 1 or more recent core dumps:

ls -l /image/core/

total 1191400
-rw------- 1 nsx-cbm nsx-cbm 45579417 Mar 8 15:06 cbm_oom.hprof.gz
-rw------- 1 root root 230343252 Mar 8 20:17 compactor_oom.hprof.gz
-rw------- 1 corfu corfu 944060040 Mar 9 14:14 corfu_oom.hprof.gz

In the NSX-T manager root '/' partition, you see a large amount of files starting with: hs_err_pid####.log
- #### represents the PID of the process and will be different on your setup.
The managers layout may not be complete for all managers seen in file:

/config/corfu/LAYOUT_CURRENT.ds
  "sequencers": [
    "192.168.1.131:9000",
    "192.168.1.133:9000",
    "192.168.1.132:9000"
  ],
  "segments": [
    {
      "replicationMode": "CHAIN_REPLICATION",
      "start": 0,
      "end": 40397089,
      "stripes": [
        {
          "logServers": [
            "192.168.1.131:9000"
          ]
        }
      ]
    },
    {
      "replicationMode": "CHAIN_REPLICATION",
      "start": 40397089,
      "end": 40397196,
      "stripes": [
        {
          "logServers": [
            "192.168.1.131:9000",
            "192.168.1.133:9000"
          ]
        }
      ]
    },
    {
      "replicationMode": "CHAIN_REPLICATION",
      "start": 40397196,
      "end": 40397804,
      "stripes": [
        {
          "logServers": [
            "192.168.1.131:9000",
            "192.168.1.133:9000"
          ]
        }
      ]
    },
    {
      "replicationMode": "CHAIN_REPLICATION",
      "start": 40397804,
      "end": -1,
      "stripes": [
        {
          "logServers": [
            "192.168.1.131:9000",
            "192.168.1.133:9000",
            "192.168.1.132:9000

from the above file, you see the below managers do not have the complete database synced to them:

Manager 10.1.1.133 is missing from replication:
"start": 0,
"end": 40397089,
Manager 192.168.1.132 is missing from replication:
"start": 40397089,
"end": 40397196,
And replication:
"start": 0,
"end": 40397089,
Manager 192.168.1.131 is the only one with a complete database.

In /var/log/corfu/corfu-compactor-audit.log, you see:

corfu-compactor-audit.9.log:2022-03-02T18:54:23.170Z  INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 24113825 (exclusive).

corfu-compactor-audit.9.log:2022-03-02T19:09:22.956Z  INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 24163751 (exclusive).

corfu-compactor-audit.log:2022-03-07T15:53:56.666Z  INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 28012247 (exclusive).

...

corfu-compactor-audit.log:2022-03-09T12:57:25.040Z  INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 28012247 (exclusive).

corfu-compactor-audit.log:2022-03-09T13:57:11.401Z  INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 28012247 (exclusive).

corfu-compactor-audit.log:2022-03-09T14:42:42.964Z  INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 28012247 (exclusive).

corfu-compactor-audit.log:2022-03-09T14:59:07.533Z  INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 28012247 (exclusive).

corfu-compactor-audit.log:2022-03-09T15:15:43.943Z  INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(4s), log address up to 28012247 (exclusive).

The above log entries imply compactor is running, but not trimming, as the log address is not increasing.
In the same log, we see the last completed checkpoint for table 3c54c60e-####-####-####-f03724af9649 has a very large number of entries, large in size and took a long time to complete:

2022-03-03T15:29:29.765Z  INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for 3c54c60e-####-####-####-f03724af9649, entries(621841), cpSize(213995471) bytes at snapshot Token(epoch=13, sequence=27948473) in 934074 ms

2022-03-03T16:57:20.892Z  INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for 3c54c60e-####-####-####-f03724af9649, entries(619376), cpSize(213145572) bytes at snapshot Token(epoch=13, sequence=28017532) in 4606838 ms

You can see the corfu compactor service crashing:

2022-03-03T17:33:09.152Z  INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] Starting checkpoint namespace: nsx, tableName: LoginLogoutEvent

2022-03-03T17:33:09.152Z INFO main MultiCheckpointWriter - appendCheckpoints: appending checkpoints for 1 maps

2022-03-03T17:33:09.164Z  INFO main CheckpointWriter - appendCheckpoint: Started checkpoint for 3c54c60e-####-####-####-f03724af9649 at snapshot Token(epoch=13, sequence=28250089)

......
Aborting due to java.lang.OutOfMemoryError: Java heap space
......
Aborted (core dumped)

2022-03-03T17:47:22.761Z  INFO Runner - Failed to run compactor tool: Command 'MALLOC_TRIM_THRESHOLD_=1310720 nice -n -10 java -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:/var/log/corfu/compactor-gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -XX:+UseStringDeduplication -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/image/core/compactor_oom.hprof -XX:OnOutOfMemoryError="gzip -f /image/core/compactor_oom.hprof" -XX:+CrashOnOutOfMemoryError -Xms963m -Xmx963m -Djava.io.tmpdir=/image/corfu-tools/temp -Djdk.nio.maxCachedBufferSize=1048576 -Dio.netty.recycler.maxCapacityPerThread=0 -DlogFilePrefix=/var/log/corfu/corfu-compactor-audit -Dlog4j.configurationFile=/opt/vmware/ufo-tools/corfu-compactor-log4j2.xml -Dcorfu-property-file-path=/opt/vmware/cbm/etc/ufo-factory.properties -cp "/opt/vmware/ufo-tools/*" com.vmware.nsx.platform.ufo.UfoCompactorMain -hostname 10.1.1.132 -hostname 10.1.1.133 -hostname 10.1.1.131 -port 9000 -trim -useDistributedLock -lockCorfuHostname 10.1.1.131 -lockCorfuPort 9000 -bulkReadSize 50' returned non-zero exit status 134.

And other services are also running out of memory:

grep "| java.lang.OutOfMemoryError: Java heap space" tanuki.log | head
INFO | jvm 1 | 2022/03/04 20:45:02 | java.lang.OutOfMemoryError: Java heap space
INFO | jvm 2 | 2022/03/07 03:01:18 | java.lang.OutOfMemoryError: Java heap space
INFO | jvm 3 | 2022/03/07 12:02:34 | java.lang.OutOfMemoryError: Java heap space
INFO | jvm 4 | 2022/03/07 18:21:28 | java.lang.OutOfMemoryError: Java heap space
INFO | jvm 5 | 2022/03/07 19:44:54 | java.lang.OutOfMemoryError: Java heap space
INFO | jvm 6 | 2022/03/07 20:47:07 | java.lang.OutOfMemoryError: Java heap space
INFO | jvm 7 | 2022/03/08 03:36:20 | java.lang.OutOfMemoryError: Java heap space
INFO | jvm 8 | 2022/03/08 05:35:02 | java.lang.OutOfMemoryError: Java heap space
INFO | jvm 9 | 2022/03/08 08:10:50 | java.lang.OutOfMemoryError: Java heap space
INFO | jvm 10 | 2022/03/08 11:18:34 | java.lang.OutOfMemoryError: Java heap space

Environment

VMware NSX-T Data Center 3.x

You are using IDFW and have log scrapping configured AD and Aria Operations for Logs.

Cause

This issue can occur when IDFW is configured with AD log scrapping and there are to many login and logout events.
The mechanism used to clean up this table can not keep up with the events and eventually the table grows to big for the corfu compactor service to complete, which causes the corfu compactor service to crash.
This issue can cause other services to crash as seen in the tanuki log above, due to the amount of memory taken by the corfu compactor service when trying to complete the compaction.

Resolution

This issue is resolved in VMware NSX-T Data Center 3.2.2.0
This issue is resolved in VMware NSX-T Data Center 3.2.3.0
This issue is resolved in VMware NSX 4.0.1.1

Workaround:
It is possible to increase the intensity of the IDFW cleaner to start more often and cleanup these entries, thus reducing the retention time of the events in the corfu table.
If you believe you have encountered this issue, please open a support request and reference this KB.