Edge nodes are in failed state in the NSX UI due Traceflow logging
search cancel

Edge nodes are in failed state in the NSX UI due Traceflow logging

book

Article ID: 382859

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • The NSX edge node(s) show high CPU for the datapathd service, when running top:

    PID   USER     PR NI VIRT    RES    SHR   S %CPU  %MEM TIME+     TGID COMMAND

    ##### root     20 0  33.6g   312240 56028 R 383.3 0.5  103000:42 4961 /opt/vmware/nsx-edge/sbin/datapathd --no-chdir --unixctl=/var/run/vmware/edge/dpd.ctl --pidfile=/va+

##### nsx-ops+ 20 0  2549096 1.1g   24028 S 127.8 1.7  507:17.47 3318 /usr/bin/opsAgent

  • The NSX manager service appl-proxy is consuming large amount of CPU, when running top:

    PID   USER        PR  NI  VIRT    RES       SHR     S  %CPU     %MEM  TIME+   TGID COMMAND

    ####  appl-proxy  20  0   18.4g   17.2g     5744    S  99.3     36.6  #####   #### appl-proxy

  • The NSX manager log shows high message queue:

    NSX 1214779 - [nsx@6876 comp="nsx-manager" subcomp="appl-proxy" s2comp="nsx-rpc" tid="1214790" level="INFO"] RpcConnection[3 Connected on unix:///var/run/vmware/appl-proxy/aph.sock 0] NsxRpc txQueue size reached 3412350. Last enqueued message {version: 1, flags: 0000, total size: 390, stream_id: ########-####-####-####-############} {type: Stream control(2), size: 0} {type: Frame(3), size: 60} {type: Payload(1), size: 197} {type: Trace(4), size: 36} {type: Trace(4), size: 57} with priority 64

    NSX 1214779 - [nsx@6876 comp="nsx-manager" subcomp="appl-proxy" s2comp="nsx-rpc" tid="1214790" level="INFO"] RpcConnection[3 Connected on unix:///var/run/vmware/appl-proxy/aph.sock 0] Rpc messages dropped 18987

  • On the NSX-T manager log /var/log/proton/nsxapi.log we see the following:

    INFO TraceflowRpcDispatch2 PolicyServiceImpl 4206 POLICY [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Creating entity: /global-infra/traceflows/xxxxxxxx-0f60-8d07-cdda-xxxxxxxxxxxx/traceflow-observa

  • Checking the NSX edge node /var/log/syslog, we see a large number of traceflow entries:

    grep -i traceflow syslog |wc -l

    464213

  • The NSX edge node high CPU correlates to the time you started seeing traceflow entries in the NSX edge node /var/log/syslog.

  • In the NSX manager UI, the NSX edge node(s) show in failed state and the NSX manager cluster as degraded. 

  • Traceflow is a manual driven operation via NSX UI or by API call and you may have only instigated a small number of these operations.

  • The logical router (T0), which the traceflow traverses, uses an overlay segment as the uplink and is set to use ECMP.

Cause

There is an issue with flowcache on the NSX edge node, when the T0 logical router uplink uses an overlay segment and ECMP, which will incorrectly set the traceflow bit on a packet, even if the incoming packet does not have the traceflow bit set. This leads to a flood of traceflow logging (observations), which increases the NSX edge node CPU and the NSX manager CPU, which is processing the flood of observations. 

Resolution

This issue is resolved in VMware NSX 4.2.1.4, available at Broadcom downloads.

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Additional Information

If this KB did not help resolve your issue, you can review the following KB for further troubleshooting steps: Troubleshooting NSX Traceflow