Edge nodes are in failed state in the NSX UI due Traceflow logging

Products

VMware NSX

Issue/Introduction

The NSX edge node(s) show high CPU for the datapathd service, when running top:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ TGID COMMAND

##### root 20 0 33.6g 312240 56028 R 383.3 0.5 103000:42 4961 /opt/vmware/nsx-edge/sbin/datapathd --no-chdir --unixctl=/var/run/vmware/edge/dpd.ctl --pidfile=/va+

##### nsx-ops+ 20 0 2549096 1.1g 24028 S 127.8 1.7 507:17.47 3318 /usr/bin/opsAgent

The NSX manager service appl-proxy is consuming large amount of CPU, when running top:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ TGID COMMAND

#### appl-proxy 20 0 18.4g 17.2g 5744 S 99.3 36.6 ##### #### appl-proxy
The NSX manager log shows high message queue:

NSX 1214779 - [nsx@6876 comp="nsx-manager" subcomp="appl-proxy" s2comp="nsx-rpc" tid="1214790" level="INFO"] RpcConnection[3 Connected on unix:///var/run/vmware/appl-proxy/aph.sock 0] NsxRpc txQueue size reached 3412350. Last enqueued message {version: 1, flags: 0000, total size: 390, stream_id: ########-####-####-####-############} {type: Stream control(2), size: 0} {type: Frame(3), size: 60} {type: Payload(1), size: 197} {type: Trace(4), size: 36} {type: Trace(4), size: 57} with priority 64

NSX 1214779 - [nsx@6876 comp="nsx-manager" subcomp="appl-proxy" s2comp="nsx-rpc" tid="1214790" level="INFO"] RpcConnection[3 Connected on unix:///var/run/vmware/appl-proxy/aph.sock 0] Rpc messages dropped 18987
On the NSX-T manager log /var/log/proton/nsxapi.log we see the following:

INFO TraceflowRpcDispatch2 PolicyServiceImpl 4206 POLICY [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Creating entity: /global-infra/traceflows/xxxxxxxx-0f60-8d07-cdda-xxxxxxxxxxxx/traceflow-observa
Checking the NSX edge node /var/log/syslog, we see a large number of traceflow entries:

grep -i traceflow syslog |wc -l

464213
The NSX edge node high CPU correlates to the time you started seeing traceflow entries in the NSX edge node /var/log/syslog.
In the NSX manager UI, the NSX edge node(s) show in failed state and the NSX manager cluster as degraded.
Traceflow is a manual driven operation via NSX UI or by API call and you may have only instigated a small number of these operations.
The logical router (T0), which the traceflow traverses, uses an overlay segment as the uplink and is set to use ECMP.

Cause

There is an issue with flowcache on the NSX edge node, when the T0 logical router uplink uses an overlay segment and ECMP, which will incorrectly set the traceflow bit on a packet, even if the incoming packet does not have the traceflow bit set. This leads to a flood of traceflow logging (observations), which increases the NSX edge node CPU and the NSX manager CPU, which is processing the flood of observations.

Resolution

This issue is resolved in VMware NSX 4.2.1.4, available at Broadcom downloads.

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Additional Information

If this KB did not help resolve your issue, you can review the following KB for further troubleshooting steps: Troubleshooting NSX Traceflow