WARN EventReportProcessor-1-3 EventReportSyslogSender 77561 MONITORING [nsx@6876 comp="global-manager" entId="ID" eventFeatureName="federation" eventSev="warning" eventState="On" eventType="queue_occupancy_threshold_exceeded" level="WARNING" subcomp="async-replicator"] Queue
Caused by: java.lang.NullPointerException: Cannot invoke "java.util.Map.containsKey(Object)" because the return value of "java.lang.ThreadLocal.get()" is null at com.vmware.nsx.management.policy.policyframework.service.ops.traceflow.GmTraceflowListener.changeToOldTraceflowPath(GmTraceflowListener.java:60) ~[libgm-framework-api.jar:?] at com.vmware.nsx.management.policy.policyframework.service.span.SpanCalculationResultUtils.populateSpanCalculationResultForDeletedResource(SpanCalculationResultUtils.java:104) ~[libgm-common-framework.jar:?]
grep "java.lang.ThreadLocal.get()" gmanager.* | wc -l
603989
VMware NSX 4.2.x
The Traceflow feature, when started on the UI, the system creates a Traceflow Observation, Traceflow Status, along with Traceflow Config that the user made.
Traceflow Config holds information about the source and destination MAC and IP.
Traceflow Observation holds information about each hop.
Traceflow Status holds information about the status of the current trace.
Every two hours, the system initiates a cleanup of the Traceflow Config. This process also removes the corresponding Traceflow Observations and Traceflow Status entries. During this cleanup, the system populates the cache on the GM to ensure that Traceflow Observations are sent only to the appropriate LMs. The expected outcome is that all observations are successfully cleaned up.
However, when the queue size continues to increase, we observe that Traceflow Observations persist in the system even after the associated Traceflow Config and Traceflow Status have been deleted. This behavior causes inconsistencies in the GM cache, as the DELETE notifications from the LMs are not handled correctly by the GM.
Workaround:
Make sure you have an up to date backup in place and know the passphrase for the backup.
The following workaround will clear the table of all entries, allowing the queue to process the message again.
corfu_tool_runner.py -n nsx -t TraceflowObservation -o clearTable