NSX Global Manager Alarm "Queue Occupancy Threshold Exceeded"

Products

VMware NSX

Issue/Introduction

In an NSX Federation environment, an alarm Queue Occupancy Threshold Exceeded LM_2_GM_NOTIFICATION is triggered in the Global Manager (GM) UI.

In the GM log /var/log/async-replicator/ar.log we can see the alarm triggered:

WARN EventReportProcessor-1-3 EventReportSyslogSender 77561 MONITORING [nsx@6876 comp="global-manager" entId="ID" eventFeatureName="federation" eventSev="warning" eventState="On" eventType="queue_occupancy_threshold_exceeded" level="WARNING" subcomp="async-replicator"] Queue

From GM UI System, Location Manager "Delta Sync", the Local Manager (LM) queue is high:
In the GM log /var/log/gmanager/gmanager.log we can see the following entries:

Caused by: java.lang.NullPointerException: Cannot invoke "java.util.Map.containsKey(Object)" because the return value of "java.lang.ThreadLocal.get()" is null
at com.vmware.nsx.management.policy.policyframework.service.ops.traceflow.GmTraceflowListener.changeToOldTraceflowPath(GmTraceflowListener.java:60) ~[libgm-framework-api.jar:?]
at com.vmware.nsx.management.policy.policyframework.service.span.SpanCalculationResultUtils.populateSpanCalculationResultForDeletedResource(SpanCalculationResultUtils.java:104) ~[libgm-common-framework.jar:?]

If we do a grep to find how often this is happening, we find it is repeating a lot in the log:

grep "java.lang.ThreadLocal.get()" gmanager.* | wc -l

603989

Environment

VMware NSX 4.2.x

Cause

The Traceflow feature, when started on the UI, the system creates a Traceflow Observation, Traceflow Status, along with Traceflow Config that the user made.

Traceflow Config holds information about the source and destination MAC and IP.
Traceflow Observation holds information about each hop.
Traceflow Status holds information about the status of the current trace.

Every two hours, the system initiates a cleanup of the Traceflow Config. This process also removes the corresponding Traceflow Observations and Traceflow Status entries. During this cleanup, the system populates the cache on the GM to ensure that Traceflow Observations are sent only to the appropriate LMs. The expected outcome is that all observations are successfully cleaned up.

However, when the queue size continues to increase, we observe that Traceflow Observations persist in the system even after the associated Traceflow Config and Traceflow Status have been deleted. This behavior causes inconsistencies in the GM cache, as the DELETE notifications from the LMs are not handled correctly by the GM.

Resolution

Workaround:

Make sure you have an up to date backup in place and know the passphrase for the backup.

The following workaround will clear the table of all entries, allowing the queue to process the message again.

SSH as root user to a single NSX manager in the active GM cluster.

Run this command:

      corfu_tool_runner.py -n nsx -t TraceflowObservation -o clearTable

Refresh and access the NSX UI, check the “Delta Sync” — the message queue should be decreasing.
After a few minutes, the “Queue Occupancy Threshold Exceeded” alarm should clear automatically.

Additional Information

Alarm for GM to LM data synchronization queue occupancy threshold exceeded