NSX UI becomes inaccessible when old mappings of Segment Profiles to individual ports are not cleaned up during Corfu compaction
search cancel

NSX UI becomes inaccessible when old mappings of Segment Profiles to individual ports are not cleaned up during Corfu compaction

book

Article ID: 319027

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  1. Segment Port Profiles (viewed in NSX UI at Networking > Segments > click on blue # of Ports / Interfaces > expand SEGMENT PORT PROFILES) are being configured per port, rather than being inherited from their Segment. This issue surfaces when this is being done on a large scale, likely through some form of automation.



  2. Corfu checkpoints for the GPRR table with UUID ########-####-####-####-##########e9 are large and the table may have close to a million entries before the Manager cluster is impacted. The checkpoint of this table takes several minutes or more because of its size:

    # grep -i "completed checkpoint for ########-####-####-####-##########e9" /var/log/corfu/corfu-compactor-audit.log
    2023-06-01T20:34:21.337Z  INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for ########-####-####-####-##########e9, entries(1213644), cpSize(1180152103) bytes at snapshot Token(epoch=1192, sequence=5313828428) in 333111 ms
    2023-06-01T20:54:05.308Z  INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for ########-####-####-####-##########e9, entries(1213681), cpSize(1180184464) bytes at snapshot Token(epoch=1192, sequence=5313956408) in 323643 ms
    2023-06-01T21:09:31.314Z  INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for ########-####-####-####-##########e9, entries(1213681), cpSize(1180184464) bytes at snapshot Token(epoch=1192, sequence=5314047599) in 338404 ms
     
  3. When this issue is hit, a dump the of the GPRR table and subsequent analysis shows the vast majority of entries in this table are from objects like port-security-profile-binding-maps and mac-discovery-profiles:

    # corfu_tool_runner.py -o showTable -n nsx -t GenericPolicyRealizedResource > gprr.txt
     
    # grep stringId gprr.txt | awk '{print $2}' | cut -d "/" -f 1-7 | sort | uniq -c | sort -nr | head
     642733 "/infra/realized-state/enforcement-points/default/security/port-security-profile-binding-maps
     322725 "/infra/realized-state/enforcement-points/default/discovery/mac-discovery-profiles
       7634 "/infra/realized-state/enforcement-points/default/services/nsservices
       1790 "/infra/realized-state/enforcement-points/default/groups/nsgroups
       1043 "/infra/realized-state/enforcement-points/default/firewalls/firewall-sections
        423 "/infra/realized-state/enforcement-points/default/dhcp-servers/dhcp-server-<UUID>
        244 "/infra/realized-state/enforcement-points/default/dhcp-servers/dhcp-server-<UUID>
         73 "/infra/realized-state/enforcement-points/default/ops/ipfix-dfw-profiles
         49 "/infra/realized-state/enforcement-points/default/dhcp-servers/dhcp-server-<UUID>
         44 "/infra/realized-state/enforcement-points/default/dhcp-servers/dhcp-server-<UUID>
     
  4. /config usage will consistently grow if the GPRR table becomes too large and Corfu compaction is failing. Beyond 10% alarms are thrown in the NSX UI and the UI can become inaccessible:

    # df -h
    Filesystem                  Size  Used Avail Use% Mounted on
    udev                         24G     0   24G   0% /dev
    tmpfs                       4.8G  7.4M  4.8G   1% /run
    /dev/sda2                    11G  7.1G  2.7G  74% /
    tmpfs                        24G  616K   24G   1% /dev/shm
    tmpfs                       5.0M     0  5.0M   0% /run/lock
    tmpfs                        24G     0   24G   0% /sys/fs/cgroup
    /dev/sda1                   930M  8.3M  857M   1% /boot
    /dev/mapper/nsx-repository   31G  7.0G   22G  25% /repository
    /dev/mapper/nsx-var+dump    9.2G  296M  8.4G   4% /var/dump
    /dev/mapper/nsx-tmp         3.7G  9.9M  3.5G   1% /tmp
    /dev/mapper/nsx-config       29G   13G   15G  46% /config
    /dev/mapper/nsx-image        42G  6.0G   34G  16% /image
    /dev/mapper/nsx-secondary    98G  3.8G   90G   5% /nonconfig
    /dev/mapper/nsx-var+log      27G   15G   11G  59% /var/log
    tmpfs                       4.8G     0  4.8G   0% /run/user/1007
    tmpfs                       4.8G     0  4.8G   0% /run/user/0
  5. /var/log/corfu/corfu-compactor-audit.log shows Corfu database compaction failing with OutOfMemoryError:

    2023-04-07T16:12:43.913Z  INFO metrics-logger-reporter-1-thread-1 metricsdata - type=TIMER, name=com.vmware.nsx.platform.clustering.persistence.corfu.CorfuDbDataStoreUfo.create, count=1, min=1448.6027609999999, max=1448.6027609999999, mean=1448.6027609999999, stddev=0.0, median=1448.6027609999999, p75=1448.6027609999999, p95=1448.6027609999999, p98=1448.6027609999999, p99=1448.6027609999999, p999=1448.6027609999999, mean_rate=0.0017346739786195718, m1=1.4970365977540202E-5, m5=0.029913723844527035, m15=0.10616389011240257, rate_unit=events/second, duration_unit=milliseconds
    java.lang.OutOfMemoryError: Java heap space
    Dumping heap to /image/core/compactor_oom.hprof ...
    Heap dump file created [2332530344 bytes in 8.114 secs]
    #
    # java.lang.OutOfMemoryError: Java heap space
    # -XX:OnOutOfMemoryError="gzip -f /image/core/compactor_oom.hprof"
    #   Executing /bin/sh -c "gzip -f /image/core/compactor_oom.hprof"...
    Aborting due to java.lang.OutOfMemoryError: Java heap space
    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  INVALID (0xe0000000) at pc=0x0000000000000000, pid=20661, tid=0x000079013c94f700
    #  fatal error: OutOfMemory encountered: Java heap space

  6. There may be core dumps from services and processes crashing with OOM errors:

    # ls -ltr /image/core
    -rw------- 1 nsx-cbm nsx-cbm  46385184 Apr  5 20:28 cbm_oom.hprof.gz
    -rw------- 1 uproton uproton        37 Apr  5 20:58 proton_oom.hprof.gz
    -rw------- 1 root    root    331866016 Apr  6 17:46 compactor_oom.hprof.gz

Cause

When Segment Profiles are configured for individual ports instead of Segments, Corfu compaction does not properly clean up old segment profile-to-port mappings.

Resolution

The Segment Security and SpoofGuard mapping cleanup issue is resolved in VMware NSX-T Data Center 3.2.3.

The MAC Discovery and IP Discovery profile mapping cleanup issue is resolved in VMware NSX 4.1.1.

See Workaround steps to remove stale entries already present in the GPRR table to stabilize the cluster before upgrading. The workaround cleans up all types of stale profile-to-port mappings.

Workaround:
Steps to remove old stale Segment Profile-to-port mappings from GPRR table in Corfu database:
 
  1. Confirm backups of the NSX Manager cluster are being taken regularly and take a new backup before executing workaround steps.
  2. Copy the attached logical-migration.jar file to the /opt/vmware/upgrade-coordinator-tomcat/temp/ directory on one of the NSX Manager nodes in the cluster.
  3. Stop proton on all three Manager nodes from the root shell:

    service proton stop

  4. Start the cleanup procedure

    java -Xms5g -Xmx10g -Dcorfu-property-file-path=/opt/vmware/upgrade-coordinator-tomcat/conf/ufo-factory.properties -Djava.io.tmpdir=/opt/vmware/upgrade-coordinator-tomcat/temp -DLog4jContextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector -Dlog4j.configurationFile=/opt/vmware/upgrade-coordinator-tomcat/conf/log4j2.xml -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.util.logging.config.file=/opt/vmware/upgrade-coordinator-tomcat/conf/logging.properties -Dnsx-service-type=nsx-manager -DStaleSegmentPortBindingMapsRectifier.dryRun=false -DStaleSegmentPortBindingMapsRectifier.batchSize=10 -DStaleSegmentPortBindingMapsRectifier.maxThreads=1 -DStaleSegmentPortBindingMapsRectifier.maxTimeoutMinutes=30 -cp /opt/vmware/upgrade-coordinator-tomcat/temp/logical-migration.jar com.vmware.nsx.management.migration.impl.StaleSegmentPortBindingMapsRectifier
     
  5. In another window to the same Manager node the tool was run from, monitor the execution of the script in the upgrade-coordinator.log

    tail -F /var/log/upgrade-coordinator/upgrade-coordinator.log
     
  6. Wait for tool execution to complete. This can take around 15 minutes. Once finished, the prompt will return in the window where the script was run and upgrade-coordinator.log will show "Migration task finished."
  7. Start proton on all three Manager nodes:

    service proton start
     
  8. Wait for at least three compaction cycles to complete (compaction runs every 15 minutes) and verify that entry count in GPRR table has come down:

    # grep -i "completed checkpoint for ########-####-####-####-##########e9" /var/log/corfu/corfu-compactor-audit.log
    /var/log/corfu/corfu-compactor-audit.log:2023-06-01T21:13:15.670Z  INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for ########-####-####-####-##########e9, entries(1131104), cpSize(1089161920) bytes at snapshot Token(epoch=1192, sequence=5314123624) in 308297 ms
    /var/log/corfu/corfu-compactor-audit.log:2023-06-01T21:24:54.489Z  INFO main CheckpointWriter - appendCheckpoint: completed checkpoint for ########-####-####-####-##########e9, entries(30775), cpSize(31698435) bytes at snapshot Token(epoch=1192, sequence=5314265183) in 67668 ms

  9. Verify that /config usage has come down as well:

    # df -h
    Filesystem                   Size  Used Avail Use% Mounted on
    udev                          24G     0   24G   0% /dev
    tmpfs                        4.8G  7.5M  4.8G   1% /run
    /dev/sda2                     11G  6.4G  3.4G  66% /
    tmpfs                         24G  4.7M   24G   1% /dev/shm
    tmpfs                        5.0M     0  5.0M   0% /run/lock
    tmpfs                         24G     0   24G   0% /sys/fs/cgroup
    /dev/sda3                     11G   41M  9.7G   1% /os_bak
    /dev/sda1                    944M  9.4M  870M   2% /boot
    /dev/mapper/nsx-var+dump     9.4G   37M  8.8G   1% /var/dump
    /dev/mapper/nsx-config__bak   29G   45M   28G   1% /config_bak
    /dev/mapper/nsx-repository    31G   16G   14G  53% /repository
    /dev/mapper/nsx-var+log       27G  9.3G   17G  37% /var/log
    /dev/mapper/nsx-tmp          3.7G   97M  3.4G   3% /tmp
    /dev/mapper/nsx-config        29G  213M   28G   1% /config
    /dev/mapper/nsx-image         42G   19G   22G  46% /image
    /dev/mapper/nsx-secondary     98G  2.7G   91G   3% /nonconfig
    tmpfs                        4.8G     0  4.8G   0% /run/user/1007
    tmpfs                        4.8G     0  4.8G   0% /run/user/0


Additional Information

Impact/Risks:
Impact varies depending on the rate at which Segment ports are being created and deleted, and how many Segment Profiles are being assigned to individual Segment ports instead of Segments.
NSX UI may be sluggish or unavailable when this issue is present, depending on the success of recent Corfu compaction cycles.

Attachments

logical-migration get_app