NSX-T UI becomes unavailable with Error code 101 when using AD with IDFW
search cancel

NSX-T UI becomes unavailable with Error code 101 when using AD with IDFW

book

Article ID: 322569

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
  • You have NSX-T 3.2.x deployed.
  • You are using IDFW and have log scrapping configured AD and log insight.
  • The NSX-T manager UI becomes unavailable displaying the following error:
Some appliance components are not functioning properly. 
Component health: SEARCH:UNKNOWN, MANAGER:UNKNOWN, NODE_MGMT:UP, UI:UP.
Error code: 101
  • In NSX-T manager cli as admin, running the following command fails: 
get cluster status
  • In the NSX-T managers as root user you see 1 or more recent core dumps:
ls -l /image/core/
total 1191400
-rw------- 1 nsx-cbm nsx-cbm  45579417 Mar  8 15:06 cbm_oom.hprof.gz
-rw------- 1 root    root    230343252 Mar  8 20:17 compactor_oom.hprof.gz
-rw------- 1 corfu   corfu   944060040 Mar  9 14:14 corfu_oom.hprof.gz
  • In the NSX-T manager root '/' partition, you see a large amount of files starting with: hs_err_pidXXXX.log
    • XXXX represents the PID of the process and will be different on your setup.
  • The managers layout may not be complete for all managers seen in file:
/config/corfu/LAYOUT_CURRENT.ds
  "sequencers": [
    "192.168.1.131:9000",
    "192.168.1.133:9000",
    "192.168.1.132:9000"
  ],
  "segments": [
    {
      "replicationMode": "CHAIN_REPLICATION",
      "start": 0,
      "end": 40397089,
      "stripes": [
        {
          "logServers": [
            "192.168.1.131:9000"
          ]
        }
      ]
    },
    {
      "replicationMode": "CHAIN_REPLICATION",
      "start": 40397089,
      "end": 40397196,
      "stripes": [
        {
          "logServers": [
            "192.168.1.131:9000",
            "192.168.1.133:9000"
          ]
        }
      ]
    },
    {
      "replicationMode": "CHAIN_REPLICATION",
      "start": 40397196,
      "end": 40397804,
      "stripes": [
        {
          "logServers": [
            "192.168.1.131:9000",
            "192.168.1.133:9000"
          ]
        }
      ]
    },
    {
      "replicationMode": "CHAIN_REPLICATION",
      "start": 40397804,
      "end": -1,
      "stripes": [
        {
          "logServers": [
            "192.168.1.131:9000",
            "192.168.1.133:9000",
            "192.168.1.132:9000
  • from the above file, we see the below managers do not have the complete database syn'ed to them:
Manager 10.1.1.133 is missing from replication:
      "start": 0,
      "end": 40397089,
Manager 192.168.1.132 is missing from replication:
      "start": 40397089,
      "end": 40397196,
And replication:
      "start": 0,
      "end": 40397089,
Manager 192.168.1.131 is the only one with a complete database.
  • In /var/log/corfu-compactor-audit.log, we see:
corfu-compactor-audit.9.log:2022-03-02T18:54:23.170Z  INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 24113825 (exclusive).
corfu-compactor-audit.9.log:2022-03-02T19:09:22.956Z  INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 24163751 (exclusive).
corfu-compactor-audit.log:2022-03-07T15:53:56.666Z  INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 28012247 (exclusive).
...
corfu-compactor-audit.log:2022-03-09T12:57:25.040Z  INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 28012247 (exclusive).
corfu-compactor-audit.log:2022-03-09T13:57:11.401Z  INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 28012247 (exclusive).
corfu-compactor-audit.log:2022-03-09T14:42:42.964Z  INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 28012247 (exclusive).
corfu-compactor-audit.log:2022-03-09T14:59:07.533Z  INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 28012247 (exclusive).
corfu-compactor-audit.log:2022-03-09T15:15:43.943Z  INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(4s), log address up to 28012247 (exclusive).
  • The above log entries imply compactor is running, but not trimming, as the log address is not increasing.
  • In the same log, we see the last completed checkpoint for table 3c54c60e-5a89-3f7c-9f1f-f03724af9649 has a very large number of entries, large in size and took a long time to complete: 
2022-03-03T15:29:29.765Z  INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for 3c54c60e-5a89-3f7c-9f1f-f03724af9649, entries(621841), cpSize(213995471) bytes at snapshot Token(epoch=13, sequence=27948473) in 934074 ms
2022-03-03T16:57:20.892Z  INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for 3c54c60e-5a89-3f7c-9f1f-f03724af9649, entries(619376), cpSize(213145572) bytes at snapshot Token(epoch=13, sequence=28017532) in 4606838 ms
  • We can see the corfu compactor service crashing:
2022-03-03T17:33:09.152Z  INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] Starting checkpoint namespace: nsx, tableName: LoginLogoutEvent
2022-03-03T17:33:09.152Z  INFO main MultiCheckpointWriter - appendCheckpoints: appending checkpoints for 1 maps
2022-03-03T17:33:09.164Z  INFO main CheckpointWriter - appendCheckpoint: Started checkpoint for 3c54c60e-5a89-3f7c-9f1f-f03724af9649 at snapshot Token(epoch=13, sequence=28250089)
......
Aborting due to java.lang.OutOfMemoryError: Java heap space
...... 
Aborted (core dumped)
2022-03-03T17:47:22.761Z  INFO Runner - Failed to run compactor tool: Command 'MALLOC_TRIM_THRESHOLD_=1310720 nice -n -10 java -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:/var/log/corfu/compactor-gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -XX:+UseStringDeduplication -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/image/core/compactor_oom.hprof -XX:OnOutOfMemoryError="gzip -f /image/core/compactor_oom.hprof" -XX:+CrashOnOutOfMemoryError -Xms963m -Xmx963m -Djava.io.tmpdir=/image/corfu-tools/temp -Djdk.nio.maxCachedBufferSize=1048576 -Dio.netty.recycler.maxCapacityPerThread=0 -DlogFilePrefix=/var/log/corfu/corfu-compactor-audit -Dlog4j.configurationFile=/opt/vmware/ufo-tools/corfu-compactor-log4j2.xml -Dcorfu-property-file-path=/opt/vmware/cbm/etc/ufo-factory.properties -cp "/opt/vmware/ufo-tools/*" com.vmware.nsx.platform.ufo.UfoCompactorMain -hostname 10.1.1.132 -hostname 10.1.1.133 -hostname 10.1.1.131 -port 9000 -trim -useDistributedLock -lockCorfuHostname 10.1.1.131 -lockCorfuPort 9000 -bulkReadSize 50' returned non-zero exit status 134.
  • And other services are also running out of memory:
grep "| java.lang.OutOfMemoryError: Java heap space" tanuki.log | head
INFO   | jvm 1    | 2022/03/04 20:45:02 | java.lang.OutOfMemoryError: Java heap space
INFO   | jvm 2    | 2022/03/07 03:01:18 | java.lang.OutOfMemoryError: Java heap space
INFO   | jvm 3    | 2022/03/07 12:02:34 | java.lang.OutOfMemoryError: Java heap space
INFO   | jvm 4    | 2022/03/07 18:21:28 | java.lang.OutOfMemoryError: Java heap space
INFO   | jvm 5    | 2022/03/07 19:44:54 | java.lang.OutOfMemoryError: Java heap space
INFO   | jvm 6    | 2022/03/07 20:47:07 | java.lang.OutOfMemoryError: Java heap space
INFO   | jvm 7    | 2022/03/08 03:36:20 | java.lang.OutOfMemoryError: Java heap space
INFO   | jvm 8    | 2022/03/08 05:35:02 | java.lang.OutOfMemoryError: Java heap space
INFO   | jvm 9    | 2022/03/08 08:10:50 | java.lang.OutOfMemoryError: Java heap space
INFO   | jvm 10   | 2022/03/08 11:18:34 | java.lang.OutOfMemoryError: Java heap space


Environment

VMware NSX-T Data Center 3.x
VMware NSX-T Data Center

Cause

This issue can occur when IDFW is configured with AD log scrapping and there are to many login and logout events.
The mechanism used to clean up this table can not keep up with the events and eventually the table grows to big for the corfu compactor service to complete, which causes the corfu compactor service to crash.
This issue can cause other services to crash as seen in the tanuki log above, due to the amount of memory taken by the corfu compactor service when trying to complete the compaction.

Resolution

This is a known issue impacting NSX-T Data Center.

Workaround:
It is possible to increase the intensity of the IDFW cleaner to start more often and cleanup these entries, thus reducing the retention time of the events in the corfu table.
If you believe you have encountered this issue, please open a support request and reference this KB.