NSX UI becomes inaccessible when old mappings of Segment Profiles to individual ports are not cleaned up during Corfu compaction

Products

VMware NSX

Issue/Introduction

Segment Port Profiles (viewed in NSX UI at Networking > Segments > click on blue # of Ports / Interfaces > expand SEGMENT PORT PROFILES) are being configured per port, rather than being inherited from their Segment. This issue surfaces when this is being done on a large scale, likely through some form of automation.
Corfu checkpoints for the GPRR table with UUID ########-####-####-####-##########e9 are large and the table may have close to a million entries before the Manager cluster is impacted. The checkpoint of this table takes several minutes or more because of its size:

# grep -i "completed checkpoint for ########-####-####-####-##########e9" /var/log/corfu/corfu-compactor-audit.log
2023-06-01T20:34:21.337Z INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for ########-####-####-####-##########e9, entries(1213644), cpSize(1180152103) bytes at snapshot Token(epoch=1192, sequence=5313828428) in 333111 ms
2023-06-01T20:54:05.308Z INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for ########-####-####-####-##########e9, entries(1213681), cpSize(1180184464) bytes at snapshot Token(epoch=1192, sequence=5313956408) in 323643 ms
2023-06-01T21:09:31.314Z INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for ########-####-####-####-##########e9, entries(1213681), cpSize(1180184464) bytes at snapshot Token(epoch=1192, sequence=5314047599) in 338404 ms
When this issue is hit, a dump the of the GPRR table and subsequent analysis shows the vast majority of entries in this table are from objects like port-security-profile-binding-maps and mac-discovery-profiles:

# corfu_tool_runner.py -o showTable -n nsx -t GenericPolicyRealizedResource > gprr.txt

# grep stringId gprr.txt | awk '{print $2}' | cut -d "/" -f 1-7 | sort | uniq -c | sort -nr | head
642733 "/infra/realized-state/enforcement-points/default/security/port-security-profile-binding-maps
322725 "/infra/realized-state/enforcement-points/default/discovery/mac-discovery-profiles
   7634 "/infra/realized-state/enforcement-points/default/services/nsservices
   1790 "/infra/realized-state/enforcement-points/default/groups/nsgroups
   1043 "/infra/realized-state/enforcement-points/default/firewalls/firewall-sections
    423 "/infra/realized-state/enforcement-points/default/dhcp-servers/dhcp-server-<UUID>
    244 "/infra/realized-state/enforcement-points/default/dhcp-servers/dhcp-server-<UUID>
     73 "/infra/realized-state/enforcement-points/default/ops/ipfix-dfw-profiles
     49 "/infra/realized-state/enforcement-points/default/dhcp-servers/dhcp-server-<UUID>
     44 "/infra/realized-state/enforcement-points/default/dhcp-servers/dhcp-server-<UUID>
/config usage will consistently grow if the GPRR table becomes too large and Corfu compaction is failing. Beyond 10% alarms are thrown in the NSX UI and the UI can become inaccessible:

# df -h
Filesystem                  Size Used Avail Use% Mounted on
udev                         24G     0   24G   0% /dev
tmpfs                       4.8G 7.4M 4.8G   1% /run
/dev/sda2                    11G 7.1G 2.7G 74% /
tmpfs                        24G 616K   24G   1% /dev/shm
tmpfs                       5.0M     0 5.0M   0% /run/lock
tmpfs                        24G     0   24G   0% /sys/fs/cgroup
/dev/sda1                   930M 8.3M 857M   1% /boot
/dev/mapper/nsx-repository   31G 7.0G   22G 25% /repository
/dev/mapper/nsx-var+dump    9.2G 296M 8.4G   4% /var/dump
/dev/mapper/nsx-tmp         3.7G 9.9M 3.5G   1% /tmp
/dev/mapper/nsx-config       29G   13G   15G 46% /config
/dev/mapper/nsx-image        42G 6.0G   34G 16% /image
/dev/mapper/nsx-secondary    98G 3.8G   90G   5% /nonconfig
/dev/mapper/nsx-var+log      27G   15G   11G 59% /var/log
tmpfs                       4.8G     0 4.8G   0% /run/user/1007
tmpfs                       4.8G     0 4.8G   0% /run/user/0
/var/log/corfu/corfu-compactor-audit.log shows Corfu database compaction failing with OutOfMemoryError:

2023-04-07T16:12:43.913Z INFO metrics-logger-reporter-1-thread-1 metricsdata - type=TIMER, name=com.vmware.nsx.platform.clustering.persistence.corfu.CorfuDbDataStoreUfo.create, count=1, min=1448.6027609999999, max=1448.6027609999999, mean=1448.6027609999999, stddev=0.0, median=1448.6027609999999, p75=1448.6027609999999, p95=1448.6027609999999, p98=1448.6027609999999, p99=1448.6027609999999, p999=1448.6027609999999, mean_rate=0.0017346739786195718, m1=1.4970365977540202E-5, m5=0.029913723844527035, m15=0.10616389011240257, rate_unit=events/second, duration_unit=milliseconds
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /image/core/compactor_oom.hprof ...
Heap dump file created [2332530344 bytes in 8.114 secs]
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="gzip -f /image/core/compactor_oom.hprof"
# Executing /bin/sh -c "gzip -f /image/core/compactor_oom.hprof"...
Aborting due to java.lang.OutOfMemoryError: Java heap space
#
# A fatal error has been detected by the Java Runtime Environment:
#
# INVALID (0xe0000000) at pc=0x0000000000000000, pid=20661, tid=0x000079013c94f700
# fatal error: OutOfMemory encountered: Java heap space
There may be core dumps from services and processes crashing with OOM errors:

# ls -ltr /image/core
-rw------- 1 nsx-cbm nsx-cbm 46385184 Apr 5 20:28 cbm_oom.hprof.gz
-rw------- 1 uproton uproton 37 Apr 5 20:58 proton_oom.hprof.gz
-rw------- 1 root root 331866016 Apr 6 17:46 compactor_oom.hprof.gz

Cause

When Segment Profiles are configured for individual ports instead of Segments, Corfu compaction does not properly clean up old segment profile-to-port mappings.

Resolution

The Segment Security and SpoofGuard mapping cleanup issue is resolved in VMware NSX-T Data Center 3.2.3.

The MAC Discovery and IP Discovery profile mapping cleanup issue is resolved in VMware NSX 4.1.1.

See Workaround steps to remove stale entries already present in the GPRR table to stabilize the cluster before upgrading. The workaround cleans up all types of stale profile-to-port mappings.

Workaround:
Steps to remove old stale Segment Profile-to-port mappings from GPRR table in Corfu database:

Confirm backups of the NSX Manager cluster are being taken regularly and take a new backup before executing workaround steps.
Copy the attached logical-migration.jar file to the /opt/vmware/upgrade-coordinator-tomcat/temp/ directory on one of the NSX Manager nodes in the cluster.
Stop proton on all three Manager nodes from the root shell:

service proton stop
Start the cleanup procedure

java -Xms5g -Xmx10g -Dcorfu-property-file-path=/opt/vmware/upgrade-coordinator-tomcat/conf/ufo-factory.properties -Djava.io.tmpdir=/opt/vmware/upgrade-coordinator-tomcat/temp -DLog4jContextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector -Dlog4j.configurationFile=/opt/vmware/upgrade-coordinator-tomcat/conf/log4j2.xml -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.util.logging.config.file=/opt/vmware/upgrade-coordinator-tomcat/conf/logging.properties -Dnsx-service-type=nsx-manager -DStaleSegmentPortBindingMapsRectifier.dryRun=false -DStaleSegmentPortBindingMapsRectifier.batchSize=10 -DStaleSegmentPortBindingMapsRectifier.maxThreads=1 -DStaleSegmentPortBindingMapsRectifier.maxTimeoutMinutes=30 -cp /opt/vmware/upgrade-coordinator-tomcat/temp/logical-migration.jar com.vmware.nsx.management.migration.impl.StaleSegmentPortBindingMapsRectifier
In another window to the same Manager node the tool was run from, monitor the execution of the script in the upgrade-coordinator.log:

tail -F /var/log/upgrade-coordinator/upgrade-coordinator.log
Wait for tool execution to complete. This can take around 15 minutes. Once finished, the prompt will return in the window where the script was run and upgrade-coordinator.log will show "Migration task finished."
Start proton on all three Manager nodes:

service proton start
Wait for at least three compaction cycles to complete (compaction runs every 15 minutes) and verify that entry count in GPRR table has come down:

# grep -i "completed checkpoint for ########-####-####-####-##########e9" /var/log/corfu/corfu-compactor-audit.log
/var/log/corfu/corfu-compactor-audit.log:2023-06-01T21:13:15.670Z INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for ########-####-####-####-##########e9, entries(1131104), cpSize(1089161920) bytes at snapshot Token(epoch=1192, sequence=5314123624) in 308297 ms
/var/log/corfu/corfu-compactor-audit.log:2023-06-01T21:24:54.489Z INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for ########-####-####-####-##########e9, entries(30775), cpSize(31698435) bytes at snapshot Token(epoch=1192, sequence=5314265183) in 67668 ms
Verify that /config usage has come down as well:

# df -h
Filesystem                   Size Used Avail Use% Mounted on
udev                          24G     0   24G   0% /dev
tmpfs                        4.8G 7.5M 4.8G   1% /run
/dev/sda2                     11G 6.4G 3.4G 66% /
tmpfs                         24G 4.7M   24G   1% /dev/shm
tmpfs                        5.0M     0 5.0M   0% /run/lock
tmpfs                         24G     0   24G   0% /sys/fs/cgroup
/dev/sda3                     11G   41M 9.7G   1% /os_bak
/dev/sda1                    944M 9.4M 870M   2% /boot
/dev/mapper/nsx-var+dump     9.4G   37M 8.8G   1% /var/dump
/dev/mapper/nsx-config__bak   29G   45M   28G   1% /config_bak
/dev/mapper/nsx-repository    31G   16G   14G 53% /repository
/dev/mapper/nsx-var+log       27G 9.3G   17G 37% /var/log
/dev/mapper/nsx-tmp          3.7G   97M 3.4G   3% /tmp
/dev/mapper/nsx-config        29G 213M   28G   1% /config
/dev/mapper/nsx-image         42G   19G   22G 46% /image
/dev/mapper/nsx-secondary     98G 2.7G   91G   3% /nonconfig
tmpfs                        4.8G     0 4.8G   0% /run/user/1007
tmpfs                        4.8G     0 4.8G   0% /run/user/0

Additional Information

Impact/Risks:
Impact varies depending on the rate at which Segment ports are being created and deleted, and how many Segment Profiles are being assigned to individual Segment ports instead of Segments.
NSX UI may be sluggish or unavailable when this issue is present, depending on the success of recent Corfu compaction cycles.

Attachments

logical-migration get_app