The edge node(s) show high CPU for the datapathd service when running top:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ TGID COMMAND
<PID> root 20 0 33.6g 312240 56028 R 383.3 0.5 103000:42 4961 /opt/vmware/nsx-edge/sbin/datapathd --no-chdir --unixctl=/var/run/vmware/edge/dpd.ctl --pidfile=/va+
<PID>
nsx-ops+ 20 0 2549096 1.1g 24028 S 127.8 1.7 507:17.47 3318 /usr/bin/opsAgent
On the NSX-T manager log /var/log/proton/nsxapi.log we see the following:
INFO TraceflowRpcDispatch2 PolicyServiceImpl 4206 POLICY [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Creating entity: /global-infra/traceflows/xxxxxxxx-0f60-8d07-cdda-xxxxxxxxxxxx/traceflow-observa
Checking the edge node /var/log/syslog, we see a large number of traceflow entries:
grep -i traceflow syslog |wc -l
464213
The edge node high CPU correlates to the time you started seeing traceflow entries in the edge node /var/log/syslog.
Traceflow is a manual driven operation via NSX-T UI or by API call and you may have only instigated a small number of these operations.
The edge CPU can be high, if it is busy processing an extremely large number of traceflow observation, this can have impact of the services /usr/bin/opsAgent and /opt/vmware/nsx-edge/sbin/datapathd.
Traceflow is a manual operation, the number of observations which are seen in the edge syslog, should match with the number of links the traceflow packet will traverse inside the edge node.
If the traceflow was not generated in this environment, then packet captures would need to be done to identify where the traceflows are coming from.
To capture traceflow traffic
At the host level, for the edge node, identify which hosts the edge node resides on.
If the host is prepared for NSX, run the following command:
nsxdp-cli vswitch instance list
This will return a list of VM's and associated switchports and uplinks they use:
Sample result:
nsxedge02.eth2 <switchport> xxxxxxxx-8952-4492-abc3-xxxxxxxxxxxx xx:xx:xx:xx:xx:xx vmnicX
nsxedge02.eth1 <switchport> xxxxxxxx-ab3d-4a04-b2b0-xxxxxxxxxxxx xx:xx:xx:xx:xx:xx vmnicX
Then identify the uplink of the edge node, which vmnic it uses, in the example above it uses vmnicX.
If the host where the edge resides is not prepared for NSX, use the following to identify the vmnnic used for the uplinks of the edge node:
esxtop
Then press 'n' for networking, this will list the switchports and vmnics used by the edge node.
Sample result:
xxxxxxxx xxxxxxx:nsxedge02.eth2 vmnicX DvsPortset-0
xxxxxxxx xxxxxxx:nsxedge02.eth1 vmnicX DvsPortset-0
Then use the following command to capture only traceflow traffic entering the uplink:
pktcap-uw --uplink vmnicX --capture UplinkSndKernel,UplinkRcvKernel --rcf 'udp and port 6081 and ether[43:1] & 0x80 != 0x80 and ether[54:1] & 0x20 == 0x20' -o -|tcpdump-uw -enr -
That command will output results to console, to save to file:
pktcap-uw --uplink vmnicX --capture UplinkSndKernel,UplinkRcvKernel --rcf 'udp and port 6081 and ether[43:1] & 0x80 != 0x80 and ether[54:1] & 0x20 == 0x20' -o /<path-to-save-to>/<packet-cap-file-name>
Then you should then see only traceflow traffic in the results, such as this:
<date/time> <outer-source-mac> > <outer-destination-mac>, ethertype IPv4 (0x0800), length 186: 192.168.1.67.57454 > 192.168.1.5.6081: Geneve, Flags [C], vni 0xxxxx, proto TEB (0x6558), options [8 bytes]: <inner-source-mac> > inner-destination-mac>, ethertype IPv4 (0x0800), length 128: 192.168.131.0 > 172.16.10.100: ICMP echo request, id 0, seq 0, length 94