NSX-T UI inaccessible and corfu cluster is down
search cancel

NSX-T UI inaccessible and corfu cluster is down

book

Article ID: 317760

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • Corfu becomes unresponsive.
  • The corfu compactor process keeps encountering out of memory issues, as seen in /var/log/corfu/corfu-compactor-audit.log:
2021-04-29 16:06:57.480368: Runner: Failed to run compactor tool: Command 'nice -n -10 java -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:/var/log/corfu/compactor-gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -XX:+UseStringDeduplication -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/image/core -XX:+CrashOnOutOfMemoryError -Xms1931m -Xmx1931m -Djdk.nio.maxCachedBufferSize=1048576 -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.configurationFile=/opt/vmware/corfu-tools/corfu-compactor-log4j2.xml -cp "/opt/vmware/corfu-tools/corfu-compactor-1.0-jar-with-dependencies.jar:/opt/vmware/policy-tomcat/webapps/policy/WEB-INF/lib/*" com.vmware.nsx.management.tools.corfu.CorfuCompactorMain -hostname <IP address> -port 9000 -namespace nsx-policy-manager -useDistributedLock' returned non-zero exit status 134
  • Additional out of memory errors seen in the /var/log/corfu/corfu-compactor-audit.log resemble:
2021-06-10T15:04:12.277Z ERROR main FrameworkCorfuCompactor - - [nsx@6876 comp="nsx-manager" errorCode="MP1" level="ERROR" subcomp="corfu-compactor"] Checkpoint failed for framework data with namespace nsx-policy-manager java.lang.OutOfMemoryError: Java heap space at java.util.TreeMap.put(TreeMap.java:577) ~[?:1.8.0_212] at java.util.TreeSet.add(TreeSet.java:255) ~[?:1.8.0_212] at org.corfudb.runtime.view.stream.AddressMapStreamView$$Lambda$303/498104228.accept(Unknown Source) ~[?:?] at org.roaringbitmap.longlong.Roaring64NavigableMap$2.accept(Roaring64NavigableMap.java:456) ~[RoaringBitmap-0.7.36.jar:?] at org.roaringbitmap.RunContainer.forEach(RunContainer.java:2510) ~[RoaringBitmap-0.7.36.jar:?] at org.roaringbitmap.RoaringBitmap.forEach(RoaringBitmap.java:1609) ~[RoaringBitmap-0.7.36.jar:?] at org.roaringbitmap.longlong.Roaring64NavigableMap.forEach(Roaring64NavigableMap.java:452) ~[RoaringBitmap-0.7.36.jar:?]
  • /image/ may be 100% full, in /image/core/ we can see a large number of *.hprof files.
  • Log entries like the following can be found just before the corfu database becomes unresponsive and the compactor process crashes in /var/log/corfu/corfu.9000.*.log
2021-04-29T15:47:50.366Z | DEBUG | LogUnit-16 | o.c.i.LogUnitServer | log write: type: DATA, address: Token(epoch=242, sequence=2147483646), streams: {5d8c74bd-####-####-####-d73f9121ee62=2147483645} 2021-04-29T15:47:50.366Z | DEBUG | LogUnit-16 | o.c.i.LogUnitServer | log write: type: DATA, address: Token(epoch=242, sequence=2147483647), streams: {5d8c74bd-####-####-####-d73f9121ee62=2147483646} 2021-04-29T15:47:50.396Z | DEBUG | LogUnit-9 | o.c.i.LogUnitServer | log write: type: DATA, address: Token(epoch=242, sequence=2147483648), streams: {5d8c74bd-####-####-####-d73f9121ee62=2147483647}
 
Note: Above we see the bitmap integer increasing and hitting the integer.MAX_VALUE of 2147483647, this is displayed in the log as the sequence number and above we see that increased from 2147483646 in the first entry to 2147483648 in the last entry, passing the value 2147483647.

Environment

VMware NSX-T Data Center 3.x

Cause

  • The /image/core/*.hprof files are created due to the compactor process continually going out of memory, each time is does this it creates a dump file (*.hprof) in the /image/core/ directory.
  • There is a software issue which results an integer overflow, this causes a huge bitmap to be returned, then when trying to read this huge bitmap, we get the out of memory issue.
  • This can happen when the sequence number is written to the same address space 3 consecutive times before hitting the integer.MAX_VALUE of 2147483647.
  • For example we see below the sequence numbers: 2147483644, 2147483645 and 2147483646 are increasing on the same table 4251f216-####-####-####-9fe0df2243ee before hitting sequence 2147483647 (integer.MAX_VALUE):
INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for 4251f216-####-####-####-9fe0df2243ee, entries(1), cpSize(1164) bytes at snapshot Token(epoch=101, sequence=2147483645 ) in 76 ms
INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for 4251f216-####-####-####-9fe0df2243ee, entries(1), cpSize(1164) bytes at snapshot Token(epoch=101, sequence=2147483645 ) in 76 ms
INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for 4251f216-####-####-####-9fe0df2243ee, entries(1), cpSize(1164) bytes at snapshot Token(epoch=101, sequence=2147483646) in 76 ms
INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for 4251f216-####-####-####-9fe0df2243ee, entries(1), cpSize(1164) bytes at snapshot Token(epoch=101, sequence=2147483647) in 76 ms
 
Note: It may not be the same table, just that it is entering the same address space in consecutive order.

Resolution

This is resolved in NSX-T version 3.0.3.1 and 3.1.2.1

NSX-T 3.0.3.2 lacks the fix.

Workaround:
To work around this issue, contact Broadcom Support and note this Article ID (317760) in the problem description.