# grep -i "completed checkpoint for ########-####-####-####-##########e9" /var/log/corfu/corfu-compactor-audit.log
2023-06-01T20:34:21.337Z INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for ########-####-####-####-##########e9, entries(1213644), cpSize(1180152103) bytes at snapshot Token(epoch=1192, sequence=5313828428) in 333111 ms
2023-06-01T20:54:05.308Z INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for ########-####-####-####-##########e9, entries(1213681), cpSize(1180184464) bytes at snapshot Token(epoch=1192, sequence=5313956408) in 323643 ms
2023-06-01T21:09:31.314Z INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for ########-####-####-####-##########e9, entries(1213681), cpSize(1180184464) bytes at snapshot Token(epoch=1192, sequence=5314047599) in 338404 ms
# corfu_tool_runner.py -o showTable -n nsx -t GenericPolicyRealizedResource > gprr.txt
# grep stringId gprr.txt | awk '{print $2}' | cut -d "/" -f 1-7 | sort | uniq -c | sort -nr | head
642733 "/infra/realized-state/enforcement-points/default/security/port-security-profile-binding-maps
322725 "/infra/realized-state/enforcement-points/default/discovery/mac-discovery-profiles
7634 "/infra/realized-state/enforcement-points/default/services/nsservices
1790 "/infra/realized-state/enforcement-points/default/groups/nsgroups
1043 "/infra/realized-state/enforcement-points/default/firewalls/firewall-sections
423 "/infra/realized-state/enforcement-points/default/dhcp-servers/dhcp-server-<UUID>
244 "/infra/realized-state/enforcement-points/default/dhcp-servers/dhcp-server-<UUID>
73 "/infra/realized-state/enforcement-points/default/ops/ipfix-dfw-profiles
49 "/infra/realized-state/enforcement-points/default/dhcp-servers/dhcp-server-<UUID>
44 "/infra/realized-state/enforcement-points/default/dhcp-servers/dhcp-server-<UUID>
/config
usage will consistently grow if the GPRR table becomes too large and Corfu compaction is failing. Beyond 10% alarms are thrown in the NSX UI and the UI can become inaccessible:# df -h
Filesystem Size Used Avail Use% Mounted on
udev 24G 0 24G 0% /dev
tmpfs 4.8G 7.4M 4.8G 1% /run
/dev/sda2 11G 7.1G 2.7G 74% /
tmpfs 24G 616K 24G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 24G 0 24G 0% /sys/fs/cgroup
/dev/sda1 930M 8.3M 857M 1% /boot
/dev/mapper/nsx-repository 31G 7.0G 22G 25% /repository
/dev/mapper/nsx-var+dump 9.2G 296M 8.4G 4% /var/dump
/dev/mapper/nsx-tmp 3.7G 9.9M 3.5G 1% /tmp
/dev/mapper/nsx-config 29G 13G 15G 46% /config
/dev/mapper/nsx-image 42G 6.0G 34G 16% /image
/dev/mapper/nsx-secondary 98G 3.8G 90G 5% /nonconfig
/dev/mapper/nsx-var+log 27G 15G 11G 59% /var/log
tmpfs 4.8G 0 4.8G 0% /run/user/1007
tmpfs 4.8G 0 4.8G 0% /run/user/0
/var/log/corfu/corfu-compactor-audit.log
shows Corfu database compaction failing with OutOfMemoryError:2023-04-07T16:12:43.913Z INFO metrics-logger-reporter-1-thread-1 metricsdata - type=TIMER, name=com.vmware.nsx.platform.clustering.persistence.corfu.CorfuDbDataStoreUfo.create, count=1, min=1448.6027609999999, max=1448.6027609999999, mean=1448.6027609999999, stddev=0.0, median=1448.6027609999999, p75=1448.6027609999999, p95=1448.6027609999999, p98=1448.6027609999999, p99=1448.6027609999999, p999=1448.6027609999999, mean_rate=0.0017346739786195718, m1=1.4970365977540202E-5, m5=0.029913723844527035, m15=0.10616389011240257, rate_unit=events/second, duration_unit=milliseconds
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /image/core/compactor_oom.hprof ...
Heap dump file created [2332530344 bytes in 8.114 secs]
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="gzip -f /image/core/compactor_oom.hprof"
# Executing /bin/sh -c "gzip -f /image/core/compactor_oom.hprof"...
Aborting due to java.lang.OutOfMemoryError: Java heap space
#
# A fatal error has been detected by the Java Runtime Environment:
#
# INVALID (0xe0000000) at pc=0x0000000000000000, pid=20661, tid=0x000079013c94f700
# fatal error: OutOfMemory encountered: Java heap space
# ls -ltr /image/core
-rw------- 1 nsx-cbm nsx-cbm 46385184 Apr 5 20:28 cbm_oom.hprof.gz
-rw------- 1 uproton uproton 37 Apr 5 20:58 proton_oom.hprof.gz
-rw------- 1 root root 331866016 Apr 6 17:46 compactor_oom.hprof.gz
logical-migration.jar
file to the /opt/vmware/upgrade-coordinator-tomcat/temp/
directory on one of the NSX Manager nodes in the cluster.service proton stop
java -Xms5g -Xmx10g -Dcorfu-property-file-path=/opt/vmware/upgrade-coordinator-tomcat/conf/ufo-factory.properties -Djava.io.tmpdir=/opt/vmware/upgrade-coordinator-tomcat/temp -DLog4jContextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector -Dlog4j.configurationFile=/opt/vmware/upgrade-coordinator-tomcat/conf/log4j2.xml -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.util.logging.config.file=/opt/vmware/upgrade-coordinator-tomcat/conf/logging.properties -Dnsx-service-type=nsx-manager -DStaleSegmentPortBindingMapsRectifier.dryRun=false -DStaleSegmentPortBindingMapsRectifier.batchSize=10 -DStaleSegmentPortBindingMapsRectifier.maxThreads=1 -DStaleSegmentPortBindingMapsRectifier.maxTimeoutMinutes=30 -cp /opt/vmware/upgrade-coordinator-tomcat/temp/logical-migration.jar com.vmware.nsx.management.migration.impl.StaleSegmentPortBindingMapsRectifier
upgrade-coordinator.log
: tail -F /var/log/upgrade-coordinator/upgrade-coordinator.log
upgrade-coordinator.log
will show "Migration task finished.
"service proton start
# grep -i "completed checkpoint for ########-####-####-####-##########e9" /var/log/corfu/corfu-compactor-audit.log
/var/log/corfu/corfu-compactor-audit.log:2023-06-01T21:13:15.670Z INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for ########-####-####-####-##########e9, entries(1131104), cpSize(1089161920) bytes at snapshot Token(epoch=1192, sequence=5314123624) in 308297 ms
/var/log/corfu/corfu-compactor-audit.log:2023-06-01T21:24:54.489Z INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for ########-####-####-####-##########e9, entries(30775), cpSize(31698435) bytes at snapshot Token(epoch=1192, sequence=5314265183) in 67668 ms
/config
usage has come down as well:# df -h
Filesystem Size Used Avail Use% Mounted on
udev 24G 0 24G 0% /dev
tmpfs 4.8G 7.5M 4.8G 1% /run
/dev/sda2 11G 6.4G 3.4G 66% /
tmpfs 24G 4.7M 24G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 24G 0 24G 0% /sys/fs/cgroup
/dev/sda3 11G 41M 9.7G 1% /os_bak
/dev/sda1 944M 9.4M 870M 2% /boot
/dev/mapper/nsx-var+dump 9.4G 37M 8.8G 1% /var/dump
/dev/mapper/nsx-config__bak 29G 45M 28G 1% /config_bak
/dev/mapper/nsx-repository 31G 16G 14G 53% /repository
/dev/mapper/nsx-var+log 27G 9.3G 17G 37% /var/log
/dev/mapper/nsx-tmp 3.7G 97M 3.4G 3% /tmp
/dev/mapper/nsx-config 29G 213M 28G 1% /config
/dev/mapper/nsx-image 42G 19G 22G 46% /image
/dev/mapper/nsx-secondary 98G 2.7G 91G 3% /nonconfig
tmpfs 4.8G 0 4.8G 0% /run/user/1007
tmpfs 4.8G 0 4.8G 0% /run/user/0