Status of newly added Standby cluster in NSX Federation remains UNAVAILABLE

Products

VMware NSX

Issue/Introduction

This KB applies to Federated NSX environments only.
Standby Global Manager cluster may have been connected to the Active cluster, but it has been removed.
New Standby Global Manager cluster has been added to the Location Manager on the Active cluster.
Once added, Sync Status of the standby cluster remains "Unavailable".
Number of "remaining_entries_to_send" remains extremely high, and will decrease very slowly.

GET https://<active-gm>/api/v1/sites/status returns output similar to the output below (note the output of the API has been reduced to only show the relevant sections):

{
    "remote_connections": [
        {
            -- snip --
        }
    ],
    "active_standby_sync_statuses": [
        {
            "standby_site": "<standby-site-name>",
            "is_data_consistent": false,
            "description": "Config consumption complete",
            "status": "UNAVAILABLE",
            "percentage_completed": 0,
            "remaining_entries_to_send": -1957158642,
            "sync_type": "DELTA_SYNC",
            "full_sync_status": {
                -- snip --
            }
        }
    ]
}

Logging on the Standby Global Manager node indicates an exception thrown while two tables are being processed (table UUIDs are 307661b9-31fe-3518-895b-e5df563eed57 and 2688dd3a-4c3b-3241-818a-1d64554d406d), the process will shutdown and restart in constant loop:

/var/log/corfu-log-replication/corfu.9010.log

2025-12-12T11:12:52.678Z | ERROR |             discovery-service |      o.c.p.wireprotocol.LogData | Exception caught at address 16552427, [307661b9-31fe-3518-895b-e5df563eed57, 2688dd3a-4c3b-3241-818a-1d64554d406d], DATA
2025-12-12T11:12:52.678Z | DEBUG |                       netty-1 |      o.c.r.c.NettyClientRouter | addReconnectionOnCloseFuture[<standby_node_2_IP>:9000]: disconnected
2025-12-12T11:12:52.681Z | DEBUG |                       netty-2 |      o.c.r.c.NettyClientRouter | addReconnectionOnCloseFuture[<standby_node_3_IP>:9000]: disconnected
2025-12-12T11:12:52.681Z | DEBUG |                       netty-0 |      o.c.r.c.NettyClientRouter | addReconnectionOnCloseFuture[<standby_node_1_IP>:9000]: disconnected
2025-12-12T11:12:52.696Z | DEBUG |             discovery-service | t.AbstractTransactionalContext | TXAbort[TX[bb09]]
2025-12-12T11:12:52.698Z | INFO  |             discovery-service |     org.corfudb.util.FileWatcher | Closed FileWatcher.
2025-12-12T11:12:52.698Z | INFO  |               FileWatcher-0 |     org.corfudb.util.FileWatcher | FileWatcher failed to poll file /config/cluster-manager/corfu/private/keystore.jks, Exception: java.nio.file.ClosedWatchServiceException., isStopped: true
2025-12-12T11:12:52.698Z | DEBUG |             discovery-service |      o.c.r.c.NettyClientRouter | stop: Shutting down router for <standby_node_3_IP>:9000
2025-12-12T11:12:52.698Z | INFO  |               FileWatcher-0 |     org.corfudb.util.FileWatcher | Watch service is stopped. Skip reloading new watch service.
2025-12-12T11:12:52.698Z | DEBUG |             discovery-service |      o.c.r.c.NettyClientRouter | stop: Shutting down router for <standby_node_1_IP>:9000
2025-12-12T11:12:52.698Z | DEBUG |             discovery-service |      o.c.r.c.NettyClientRouter | stop: Shutting down router for <standby_node_1_IP>:9000
2025-12-12T11:12:52.914Z | INFO  |              ShutdownThread | uInterClusterReplicationServer | CleanShutdown: Starting Cleanup.
...
2025-12-12T11:12:59.266Z | INFO  |          WrapperSimpleAppMain | uInterClusterReplicationServer | Initializing LOG REPLICATION SERVER

Environment

VMware NSX (Federated deployment)
VMware NSX-T Data Center (Federated deployment)

Cause

The Standby cluster is experiencing Out-Of-Memory (OOM) errors driven by the Log Replication (LR) cache mechanism, where table FullSyncMarker is being processed by the LR service has grown significantly (the table will typically be holding >300,000 keys and occupying >150 MB).

As write operations to this table occur in parallel to the updates to the LR metadata table, accessing the metadata consequently loads the FullSyncMarker table into the application cache.
This caching behavior consumes excessive heap memory, leading to resource exhaustion.

Resolution

This issue is resolved in VMware NSX 4.2.2, available at Broadcom downloads.
If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Workaround:

Increase memory allocation for log replicator service:

SSH to the first Standby GM as root.

Create a copy of the config file:

cp /usr/tanuki/conf/corfu-log-replication-server-wrapper.conf /usr/tanuki/conf/corfu-log-replication-server-wrapper.conf.bak

Find the following parameters:

wrapper.java.initmemory.percent=2
wrapper.java.maxmemory.percent=2

Using a text editor (vi), change the values in corfu-log-replication-server-wrapper.conf from 2 to 4. If you wish, you can use the command below to do so:

sed -i 's/wrapper.java.initmemory.percent=2/wrapper.java.initmemory.percent=4/g' /usr/tanuki/conf/corfu-log-replication-server-wrapper.conf
sed -i 's/wrapper.java.maxmemory.percent=2/wrapper.java.maxmemory.percent=4/g' /usr/tanuki/conf/corfu-log-replication-server-wrapper.conf

Restart the service for the new parameters to take effect:
```
systemctl restart corfu-log-replication-server
```
Repeat on the remaining managers until the change has been introduced on all standby nodes.

Once the count of remaining_entries_to_send drops to 0, change the values back to default:
1. SSH to the first Standby GM as root.
2. Overwrite the backup file with the current .conf file. This will set everything back to default:
```
mv /usr/tanuki/conf/corfu-log-replication-server-wrapper.conf.bak /usr/tanuki/conf/corfu-log-replication-server-wrapper.conf
```
  Note: The mv command is used instead of cp (copy) as the conf.bak file is no longer needed. Using cp is also acceptable.
3. Restart the service for the new parameters to take effect:
```
systemctl restart corfu-log-replication-server
```
4. Repeat on the remaining managers until the change has been introduced on all standby nodes.

If the sync is still failing, and logging similar to the sample below is observed on the Standby node:

/var/log/corfu-log-replication/tanuki.log
INFO   | jvm 404  | 2026/01/12 13:58:34 | java.lang.OutOfMemoryError: Java heap space
STATUS | wrapper  | 2026/01/12 13:58:34 | The JVM has run out of memory.  Requesting thread dump.
STATUS | wrapper  | 2026/01/12 13:58:34 | Dumping JVM state.
STATUS | wrapper  | 2026/01/12 13:58:34 | The JVM has run out of memory.  Restarting JVM.
INFO   | jvm 404  | 2026/01/12 13:58:34 | Dumping heap to /image/core/corfu-log-replication_oom.hprof ...
..
INFO   | jvm 404  | 2026/01/12 13:58:43 | Exception in thread "discovery-service" java.lang.OutOfMemoryError: Java heap space
STATUS | wrapper  | 2026/01/12 13:58:43 | The JVM has run out of memory.  Requesting thread dump.
STATUS | wrapper  | 2026/01/12 13:58:43 | Dumping JVM state.
STATUS | wrapper  | 2026/01/12 13:58:43 | The JVM has run out of memory.  Restart JVM (Ignoring, already restarting).

You may need to disable the Log Replication cache, which is being filled up by a table larger than the memory allocated to the service, causing the service to run out of memory.

SSH to the first NSX Manager where the workaround will be applied as root.

Create a backup of the file that is to be modified:

cp /usr/tanuki/conf/corfu-log-replication-server-wrapper.conf /usr/tanuki/conf/corfu-log-replication-server-wrapper.conf.bak

Using vi, open file /usr/tanuki/conf/corfu-log-replication-server-wrapper.conf:
```
vi /usr/tanuki/conf/corfu-log-replication-server-wrapper.conf
```
Locate line that begins with # Application parameters. Add parameters as needed starting from 1 At the end of the block, add the following parameter:
```
wrapper.app.parameter.25=--lrCacheSize=0
```
Important notes: parameter number must be incremented by 1 from the last one. In the sample above, it is 24, so the new parameter will have number 25.
Save and close the file. In vi, this is done by entering the following: :wq! followed by enter.
Restart the corfu-log-replication-server service:
```
service corfu-log-replication-server stop
service corfu-log-replication-server start
```
- Please note steps 6-8 must be applied on all Global Manager nodes (both Active and Standby cluster).
- The workaround can be introduced on step by step basis (you don't have to do it all on all nodes at the same time).