NSX-T UI inaccessible due to roaring bit map

Products

VMware NSX

Issue/Introduction

Corfu becomes unresponsive.
The corfu compactor process keeps encountering out of memory issues, as seen in /var/log/corfu/corfu-compactor-audit.log:

2021-04-29 16:06:57.480368: Runner: Failed to run compactor tool: Command 'nice -n -10 java -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:/var/log/corfu/compactor-gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -XX:+UseStringDeduplication -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/image/core -XX:+CrashOnOutOfMemoryError -Xms1931m -Xmx1931m -Djdk.nio.maxCachedBufferSize=1048576 -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.configurationFile=/opt/vmware/corfu-tools/corfu-compactor-log4j2.xml -cp "/opt/vmware/corfu-tools/corfu-compactor-1.0-jar-with-dependencies.jar:/opt/vmware/policy-tomcat/webapps/policy/WEB-INF/lib/*" com.vmware.nsx.management.tools.corfu.CorfuCompactorMain -hostname ###### -port #### -namespace nsx-policy-manager -useDistributedLock' returned non-zero exit status 134

Additional out of memory errors seen in the /var/log/corfu/corfu-compactor-audit.log resemble:

2021-06-10T15:04:12.277Z ERROR main FrameworkCorfuCompactor - - [nsx@6876 comp="nsx-manager" errorCode="MP1" level="ERROR" subcomp="corfu-compactor"] Checkpoint failed for framework data with namespace nsx-policy-manager java.lang.OutOfMemoryError: Java heap space at java.util.TreeMap.put(TreeMap.java:577) ~[?:1.8.0_212] at java.util.TreeSet.add(TreeSet.java:255) ~[?:1.8.0_212] at org.corfudb.runtime.view.stream.AddressMapStreamView$$Lambda$303/498104228.accept(Unknown Source) ~[?:?] at org.roaringbitmap.longlong.Roaring64NavigableMap$2.accept(Roaring64NavigableMap.java:456) ~[RoaringBitmap-0.7.36.jar:?] at org.roaringbitmap.RunContainer.forEach(RunContainer.java:2510) ~[RoaringBitmap-0.7.36.jar:?] at org.roaringbitmap.RoaringBitmap.forEach(RoaringBitmap.java:1609) ~[RoaringBitmap-0.7.36.jar:?] at org.roaringbitmap.longlong.Roaring64NavigableMap.forEach(Roaring64NavigableMap.java:452) ~[RoaringBitmap-0.7.36.jar:?]

/image/ may be 100% full, in /image/core/ we can see a large number of *.hprof files.
Log entries like the following can be found just before the corfu database becomes unresponsive and the compactor process crashes in /var/log/corfu/corfu.9000.*.log

2021-04-29T15:47:50.366Z | DEBUG | LogUnit-16 | o.c.i.LogUnitServer | log write: type: DATA, address: Token(epoch=242, sequence=2147483646), streams: {########-####-####-####-########3645} 2021-04-29T15:47:50.366Z | DEBUG | LogUnit-16 | o.c.i.LogUnitServer | log write: type: DATA, address: Token(epoch=242, sequence=2147483647), streams: {########-####-####-####-########3646} 2021-04-29T15:47:50.396Z | DEBUG | LogUnit-9 | o.c.i.LogUnitServer | log write: type: DATA, address: Token(epoch=242, sequence=2147483648), streams: {########-####-####-####-########3647}

Note: Above we see the bitmap integer increasing and hitting the integer.MAX_VALUE of 2147483647, this is displayed in the log as the sequence number and above we see that increased from 2147483646 in the first entry to 2147483648 in the last entry, passing the value 2147483647.

Environment

VMware NSX Data Center 3.1.2 and lower

Cause

The /image/core/*.hprof files are created due to the compactor process continually going out of memory, each time is does this it creates a dump file (*.hprof) in the /image/core/ directory.
There is a software issue which results an integer overflow, this causes a huge bitmap to be returned, then when trying to read this huge bitmap, we get the out of memory issue.
This can happen when the sequence number is written to the same address space 3 consecutive times before hitting the integer.MAX_VALUE of 2147483647.
For example we see below the sequence numbers: 2147483644, 2147483645 and 2147483646 are increasing on the same table ########-####-####-####-########43ee before hitting sequence 2147483647 (integer.MAX_VALUE):

INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for ########-####-####-####-########43ee, entries(1), cpSize(1164) bytes at snapshot Token(epoch=101, sequence=2147483645 ) in 76 ms

INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for ########-####-####-####-########43ee, entries(1), cpSize(1164) bytes at snapshot Token(epoch=101, sequence=2147483646) in 76 ms

INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for ########-####-####-####-########43ee, entries(1), cpSize(1164) bytes at snapshot Token(epoch=101, sequence=2147483647) in 76 ms

Note: It may not be the same table, just that it is entering the same address space in consecutive order.

Resolution

This is resolved in VMware NSX-T Datacenter version 3.1.2.1

Additional Information

To work around this issue, contact Broadcom Support and note this Article ID (367597) in the problem description.