Symptoms:
- You have NSX-T 3.2.x deployed.
- You are using IDFW and have log scrapping configured AD and log insight.
- The NSX-T manager UI becomes unavailable displaying the following error:
Some appliance components are not functioning properly.
Component health: SEARCH:UNKNOWN, MANAGER:UNKNOWN, NODE_MGMT:UP, UI:UP.
Error code: 101
- In NSX-T manager cli as admin, running the following command fails:
get cluster status
- In the NSX-T managers as root user you see 1 or more recent core dumps:
ls -l /image/core/
total 1191400
-rw------- 1 nsx-cbm nsx-cbm 45579417 Mar 8 15:06 cbm_oom.hprof.gz
-rw------- 1 root root 230343252 Mar 8 20:17 compactor_oom.hprof.gz
-rw------- 1 corfu corfu 944060040 Mar 9 14:14 corfu_oom.hprof.gz
- In the NSX-T manager root '/' partition, you see a large amount of files starting with: hs_err_pidXXXX.log
- XXXX represents the PID of the process and will be different on your setup.
- The managers layout may not be complete for all managers seen in file:
/config/corfu/LAYOUT_CURRENT.ds
"sequencers": [
"192.168.1.131:9000",
"192.168.1.133:9000",
"192.168.1.132:9000"
],
"segments": [
{
"replicationMode": "CHAIN_REPLICATION",
"start": 0,
"end": 40397089,
"stripes": [
{
"logServers": [
"192.168.1.131:9000"
]
}
]
},
{
"replicationMode": "CHAIN_REPLICATION",
"start": 40397089,
"end": 40397196,
"stripes": [
{
"logServers": [
"192.168.1.131:9000",
"192.168.1.133:9000"
]
}
]
},
{
"replicationMode": "CHAIN_REPLICATION",
"start": 40397196,
"end": 40397804,
"stripes": [
{
"logServers": [
"192.168.1.131:9000",
"192.168.1.133:9000"
]
}
]
},
{
"replicationMode": "CHAIN_REPLICATION",
"start": 40397804,
"end": -1,
"stripes": [
{
"logServers": [
"192.168.1.131:9000",
"192.168.1.133:9000",
"192.168.1.132:9000
- from the above file, we see the below managers do not have the complete database syn'ed to them:
Manager 10.1.1.133 is missing from replication:
"start": 0,
"end": 40397089,
Manager 192.168.1.132 is missing from replication:
"start": 40397089,
"end": 40397196,
And replication:
"start": 0,
"end": 40397089,
Manager 192.168.1.131 is the only one with a complete database.
- In /var/log/corfu-compactor-audit.log, we see:
corfu-compactor-audit.9.log:2022-03-02T18:54:23.170Z INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 24113825 (exclusive).
corfu-compactor-audit.9.log:2022-03-02T19:09:22.956Z INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 24163751 (exclusive).
corfu-compactor-audit.log:2022-03-07T15:53:56.666Z INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 28012247 (exclusive).
...
corfu-compactor-audit.log:2022-03-09T12:57:25.040Z INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 28012247 (exclusive).
corfu-compactor-audit.log:2022-03-09T13:57:11.401Z INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 28012247 (exclusive).
corfu-compactor-audit.log:2022-03-09T14:42:42.964Z INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 28012247 (exclusive).
corfu-compactor-audit.log:2022-03-09T14:59:07.533Z INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(0s), log address up to 28012247 (exclusive).
corfu-compactor-audit.log:2022-03-09T15:15:43.943Z INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] UFO Trim completed, elapsed(4s), log address up to 28012247 (exclusive).
- The above log entries imply compactor is running, but not trimming, as the log address is not increasing.
- In the same log, we see the last completed checkpoint for table 3c54c60e-5a89-3f7c-9f1f-f03724af9649 has a very large number of entries, large in size and took a long time to complete:
2022-03-03T15:29:29.765Z INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for 3c54c60e-5a89-3f7c-9f1f-f03724af9649, entries(621841), cpSize(213995471) bytes at snapshot Token(epoch=13, sequence=27948473) in 934074 ms
2022-03-03T16:57:20.892Z INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for 3c54c60e-5a89-3f7c-9f1f-f03724af9649, entries(619376), cpSize(213145572) bytes at snapshot Token(epoch=13, sequence=28017532) in 4606838 ms
- We can see the corfu compactor service crashing:
2022-03-03T17:33:09.152Z INFO main UfoCompactor - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="corfu-compactor"] Starting checkpoint namespace: nsx, tableName: LoginLogoutEvent
2022-03-03T17:33:09.152Z INFO main MultiCheckpointWriter - appendCheckpoints: appending checkpoints for 1 maps
2022-03-03T17:33:09.164Z INFO main CheckpointWriter - appendCheckpoint: Started checkpoint for 3c54c60e-5a89-3f7c-9f1f-f03724af9649 at snapshot Token(epoch=13, sequence=28250089)
......
Aborting due to java.lang.OutOfMemoryError: Java heap space
......
Aborted (core dumped)
2022-03-03T17:47:22.761Z INFO Runner - Failed to run compactor tool: Command 'MALLOC_TRIM_THRESHOLD_=1310720 nice -n -10 java -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:/var/log/corfu/compactor-gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -XX:+UseStringDeduplication -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/image/core/compactor_oom.hprof -XX:OnOutOfMemoryError="gzip -f /image/core/compactor_oom.hprof" -XX:+CrashOnOutOfMemoryError -Xms963m -Xmx963m -Djava.io.tmpdir=/image/corfu-tools/temp -Djdk.nio.maxCachedBufferSize=1048576 -Dio.netty.recycler.maxCapacityPerThread=0 -DlogFilePrefix=/var/log/corfu/corfu-compactor-audit -Dlog4j.configurationFile=/opt/vmware/ufo-tools/corfu-compactor-log4j2.xml -Dcorfu-property-file-path=/opt/vmware/cbm/etc/ufo-factory.properties -cp "/opt/vmware/ufo-tools/*" com.vmware.nsx.platform.ufo.UfoCompactorMain -hostname 10.1.1.132 -hostname 10.1.1.133 -hostname 10.1.1.131 -port 9000 -trim -useDistributedLock -lockCorfuHostname 10.1.1.131 -lockCorfuPort 9000 -bulkReadSize 50' returned non-zero exit status 134.
- And other services are also running out of memory:
grep "| java.lang.OutOfMemoryError: Java heap space" tanuki.log | head
INFO | jvm 1 | 2022/03/04 20:45:02 | java.lang.OutOfMemoryError: Java heap space
INFO | jvm 2 | 2022/03/07 03:01:18 | java.lang.OutOfMemoryError: Java heap space
INFO | jvm 3 | 2022/03/07 12:02:34 | java.lang.OutOfMemoryError: Java heap space
INFO | jvm 4 | 2022/03/07 18:21:28 | java.lang.OutOfMemoryError: Java heap space
INFO | jvm 5 | 2022/03/07 19:44:54 | java.lang.OutOfMemoryError: Java heap space
INFO | jvm 6 | 2022/03/07 20:47:07 | java.lang.OutOfMemoryError: Java heap space
INFO | jvm 7 | 2022/03/08 03:36:20 | java.lang.OutOfMemoryError: Java heap space
INFO | jvm 8 | 2022/03/08 05:35:02 | java.lang.OutOfMemoryError: Java heap space
INFO | jvm 9 | 2022/03/08 08:10:50 | java.lang.OutOfMemoryError: Java heap space
INFO | jvm 10 | 2022/03/08 11:18:34 | java.lang.OutOfMemoryError: Java heap space