High CPU is seen on the NSX-T edge node due Traceflow logging
search cancel

High CPU is seen on the NSX-T edge node due Traceflow logging

book

Article ID: 382859

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • The edge node(s) show high CPU for the datapathd service when running top:

       PID   USER PR NI VIRT RES    SHR   S %CPU  %MEM TIME+ TGID COMMAND

       <PID> root 20 0 33.6g 312240 56028 R 383.3 0.5 103000:42 4961 /opt/vmware/nsx-edge/sbin/datapathd --no-chdir --unixctl=/var/run/vmware/edge/dpd.ctl --pidfile=/va+

      <PID> nsx-ops+ 20 0 2549096 1.1g 24028 S 127.8 1.7 507:17.47 3318 /usr/bin/opsAgent

  • On the NSX-T manager log /var/log/proton/nsxapi.log we see the following:

    INFO TraceflowRpcDispatch2 PolicyServiceImpl 4206 POLICY [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Creating entity: /global-infra/traceflows/xxxxxxxx-0f60-8d07-cdda-xxxxxxxxxxxx/traceflow-observa

  • Checking the edge node /var/log/syslog, we see a large number of traceflow entries:

    grep -i traceflow syslog |wc -l

    464213

  • The edge node high CPU correlates to the time you started seeing traceflow entries in the edge node /var/log/syslog.

  • Traceflow is a manual driven operation via NSX-T UI or by API call and you may have only instigated a small number of these operations.

Cause

The edge CPU can be high, if it is busy processing an extremely large number of traceflow observation, this can have impact of the services /usr/bin/opsAgent and /opt/vmware/nsx-edge/sbin/datapathd.

Resolution

Traceflow is a manual operation, the number of observations which are seen in the edge syslog, should match with the number of links the traceflow packet will traverse inside the edge node.
If the traceflow was not generated in this environment, then packet captures would need to be done to identify where the traceflows are coming from.

To capture traceflow traffic

  • At the host level, for the edge node, identify which hosts the edge node resides on.

    • If the host is prepared for NSX, run the following command:

      nsxdp-cli vswitch instance list

      • This will return a list of VM's and associated switchports and uplinks they use:

        Sample result:

        nsxedge02.eth2 <switchport> xxxxxxxx-8952-4492-abc3-xxxxxxxxxxxx xx:xx:xx:xx:xx:xx vmnicX

        nsxedge02.eth1 <switchport> xxxxxxxx-ab3d-4a04-b2b0-xxxxxxxxxxxx xx:xx:xx:xx:xx:xx vmnicX

      • Then identify the uplink of the edge node, which vmnic it uses, in the example above it uses vmnicX.

    • If the host where the edge resides is not prepared for NSX, use the following to identify the vmnnic used for the uplinks of the edge node:

      esxtop

       

      • Then press 'n' for networking, this will list the switchports and vmnics used by the edge node.

        Sample result:

        xxxxxxxx xxxxxxx:nsxedge02.eth2 vmnicX DvsPortset-0

        xxxxxxxx xxxxxxx:nsxedge02.eth1 vmnicX DvsPortset-0

  • Then use the following command to capture only traceflow traffic entering the uplink:

    pktcap-uw --uplink vmnicX --capture UplinkSndKernel,UplinkRcvKernel --rcf 'udp and port 6081 and ether[43:1] & 0x80 != 0x80 and ether[54:1] & 0x20 == 0x20' -o -|tcpdump-uw -enr -

  • That command will output results to console, to save to file:

    pktcap-uw --uplink vmnicX --capture UplinkSndKernel,UplinkRcvKernel --rcf 'udp and port 6081 and ether[43:1] & 0x80 != 0x80 and ether[54:1] & 0x20 == 0x20' -o /<path-to-save-to>/<packet-cap-file-name>

  • Then you should then see only traceflow traffic in the results, such as this:

    <date/time> <outer-source-mac> > <outer-destination-mac>, ethertype IPv4 (0x0800), length 186: 192.168.1.67.57454 > 192.168.1.5.6081: Geneve, Flags [C], vni 0xxxxx, proto TEB (0x6558), options [8 bytes]: <inner-source-mac> > inner-destination-mac>, ethertype IPv4 (0x0800), length 128: 192.168.131.0 > 172.16.10.100: ICMP echo request, id 0, seq 0, length 94