TrimmedException leads to missing configuration information in NSX 4.1.0

Products

VMware NSX

Issue/Introduction

NSX version is 4.1.0
Creation of some NSX configuration fails, for example: Groups, Logical Switches, DFW Rules, etc.
The existing configuration is missing in UI and API.
The following three log signatures are seen when this issue occurs:

1. "TrimmedException" messages are seen in NSX Manager proton logs.
Example in /var/log/proton/nsxapi.log:

2023-04-24T12:44:30.970Z WARN http-nio-127.0.0.1-7440-exec-25 AbstractQueuedStreamView 4840 Fill_Read_Queue[1a2@-1] Trim encountered.
org.corfudb.runtime.exceptions.TrimmedException: Trimmed address: 661313                                                         <<<<
        at org.corfudb.runtime.view.AddressSpaceView.isLogDataValid(AddressSpaceView.java:789) ~[?:?]
        at org.corfudb.runtime.view.AddressSpaceView.checkLogDataThrowException(AddressSpaceView.java:816) ~[?:?]
        at org.corfudb.runtime.view.AddressSpaceView.fetch(AddressSpaceView.java:810) ~[?:?]
        at org.corfudb.runtime.view.AddressSpaceView.lambda$read$9(AddressSpaceView.java:367) ~[?:?]
        at io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:57) ~[?:?]
        at org.corfudb.common.metrics.micrometer.MicroMeterUtils.lambda$time$6(MicroMeterUtils.java:121) ~[?:?]
        at java.util.Optional.map(Optional.java:215) ~[?:1.8.0_362]

2. "UnreachableClusterException" or "UnrecoverableCorfuInterruptedError" or "UnrecoverableCorfuError" errors are observed in the Corfu logs. These exceptions are preceded by a corfu-server start/restart.
Example in /var/log/corfu/corfu.9000.log:

2023-04-24T12:01:43.022Z | ERROR |       Cmpt-9000-chkpter | o.c.r.o.MVOCorfuCompileProxy | abortTransaction[ImmutableCorfuTable[f2a]] Abort Transaction with Exception {}
org.corfudb.runtime.exceptions.UnreachableClusterException: Runtime stalled. Invoking systemDownHandler after 60 unsuccessful tries.
    at org.corfudb.infrastructure.ManagementServer.lambda$new$0(ManagementServer.java:99)
    at org.corfudb.runtime.view.AbstractView.layoutHelper(AbstractView.java:176)
    at org.corfudb.runtime.view.AbstractView.layoutHelper(AbstractView.java:61)
    at org.corfudb.runtime.view.AddressSpaceView.fetchAll(AddressSpaceView.java:744)
    at org.corfudb.runtime.view.AddressSpaceView.lambda$read$13(AddressSpaceView.java:489)

3. The Corfu compactor leader logs in one of the Managers indicates a reduction in the number of tables that were checkpointed, and the timestamp of the reduction is close to the TrimmedException timestamps.

Example: less corfu-compactor-leader.1.log.gz | grep "Total time taken for the compaction cycle"

2023-04-24T10:50:27.002Z | INFO |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 488499ms for 997 tables with status COMPLETED
.
.
2023-04-24T12:24:14.552Z | INFO |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 561543ms for 949 tables with status COMPLETED       <----- Notice that the # of tables got reduced
2023-04-24T12:37:15.632Z | INFO |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 458044ms for 949 tables with status COMPLETED
2023-04-24T12:52:58.115Z | INFO |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 498881ms for 949 tables with status COMPLETED
2023-04-24T13:07:54.492Z | INFO |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 492395ms for 949 tables with status COMPLETED
2023-04-24T13:21:37.181Z | INFO |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 410507ms for 949 tables with status COMPLETED
2023-04-24T13:36:39.089Z | INFO |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 409029ms for 949 tables with status COMPLETED
2023-04-24T13:51:27.566Z | INFO |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 395965ms for 949 tables with status COMPLETED
2023-04-24T14:32:10.576Z | INFO |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 567154ms for 997 tables with status COMPLETED
2023-04-24T14:49:27.768Z | INFO |              Cmpt-9000-chkpter |               compactor-leader | Total time taken for the compaction cycle: 510723ms for 997 tables with status COMPLETED

Cause

NSX initializes many processes at the start or restart of the Corfu server, one of which is the compactor process. When compactor initialization is in progress, the first table open results in accessing the registry table. This action syncs the contents of the registry table into the Corfu object layer. If an UnreachableClusterException is encountered while the sync is happening, it is not handled and is ignored in the compactor.
This results in RegistryTable being in an inconsistent state with the database.

For example, on the disk, the registry table has N number of tables but the Object layer of Corfu only has N-48 tables as the sync is incomplete due to a ClusterUnreachableException. Now when the compactor tries to read the list of tables to compact, it sees only N-48 tables as the object layer has inconsistent data, resulting in data loss for 48 tables. From this point, every cycle of the compactor only gets N-48 tables until the next restart of Corfu server when the sync succeeds, and it has all N tables. The checkpoint works correctly for all N tables, but those 48 tables have no data in them anymore.

Resolution

This issue is resolved in NSX 4.1.0.2 and higher versions.

Please be advised that the VMware NSX team has decided to withdraw the NSX 4.1.0 release from the download page in favor of NSX 4.1.0.2.
Customers who have downloaded and deployed NSX 4.1.0 remain supported but are strongly advised to upgrade to NSX 4.1.0.2 or higher at their earliest convenience.

Workaround:

Recovery
If this issue has been experienced, it is necessary to restore the NSX Manager from backup.

Prevention
To prevent this issue from occurring, VMware recommends an upgrade to NSX 4.1.0.2.

If for some reason an upgrade is not possible, please open a support request and refer to this KB article.

Additional Information

Impact/Risks:
Some NSX configurations may get deleted.