NSX-T Edge Transport Nodes - the Edge CPU has reached XX % which is at or above the high threshold value of 60%
search cancel

NSX-T Edge Transport Nodes - the Edge CPU has reached XX % which is at or above the high threshold value of 60%

book

Article ID: 324391

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
  • Edge Transport Nodes are reporting High CPU usage.
  • Alarms are being raised in the NSX-T UI Alarm section:
 
  • In the NSX-T UI, navigate to System-> Fabric -> Nodes -> Edge Transport Nodes. Select the impacted Edge TN and go to Monitor. Services CPU is reported high CPU (between 50 and 70% - Alarm are trigerred from 60%). 
  • In the same page, confirm Datapath CPU is fine. (Under 50%).
  • Confirm the High CPU Usage is due to the QoS process:
    1. Access the Edge in root mode and identifiy the datapathd PID: ps -aux | grep "datapathd".
    2. Run the command "top -H -p <Datapathd PID>"
    The following sample can be seen:
    top - 12:02:23 up 18 days, 22:55,  1 user,  load average: 3.09, 3.52, 3.05
    Threads:  34 total,   3 running,  31 sleeping,   0 stopped,   0 zombie
    %Cpu(s): 14.3 us,  7.5 sy,  0.0 ni, 78.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    KiB Mem :  7962892 total,   107536 free,  3884320 used,  3971036 buff/cache
    KiB Swap:        0 total,        0 free,        0 used.  3961740 avail Mem
    
      PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
     2936 root      20   0 32.724g  51932  20116 R 91.2  0.7   4:45.58 qos14
     2807 root      20   0 32.724g  51932  20116 S  5.8  0.7   3132:44 dp-fp:0
     2905 root      20   0 32.724g  51932  20116 S  5.8  0.7   3251:53 dp-fp:1
     2916 root      20   0 32.724g  51932  20116 R  3.9  0.7 166:07.93 dp-bfd-mon4
    
  • The Tier 1 router QOS has been enabled. This can be confirmed by looking into the /var/log/syslog of the impacted Edge:
<182>1 2020-10-27T11:46:28.318950+00:00 edge02.corp.local NSX 2814 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd.dpc_pb(dp-ipc15)" level="INFO"] QoS enabled on lrouter a21f20eb-bf07-4ced-bc8b-5dfd7b0d8f35, dir: 1, committed_bw: 1, burst_size: 1
<182>1 2020-10-27T11:49:48.024808+00:00 edge02.corp.local NSX 2814 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd.dpc_pb(dp-ipc15)" level="INFO"] QoS enabled on lrouter 54f67152-2131-4e22-bf34-5e9773b58c3a, dir: 1, committed_bw: 1, burst_size: 1
  • Then the same feature was disabled (Same file: /var/log/syslog of the impacted Edge):
<182>1 2020-10-27T11:56:22.440196+00:00 edge01.corp.local NSX 2807 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd.dpc_pb(dp-ipc12)" level="INFO"] QoS disabled on lrouter a21f20eb-bf07-4ced-bc8b-5dfd7b0d8f35, dir: 1
<182>1 2020-10-27T11:56:45.423255+00:00 edge01.corp.local NSX 2807 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd.dpc_pb(dp-ipc12)" level="INFO"] QoS disabled on lrouter 54f67152-2131-4e22-bf34-5e9773b58c3a, dir: 1



Environment

VMware NSX-T Data Center

Resolution

Currently there is no resolution.

Workaround:
There are two possibles workarounds:
If you intend to use T1 router Ingress QoS, enable it on the T1 and the issue will disapear.

If you don't intend to use T1 router Ingress QoS, once disabled. Restart the dataplane service on the impacted Edge TNs:
  1. Access the Edge in SSH as admin.
  2. Run the CLI command to restart the dataplane service: "restart service dataplane".
To confirm the QoS is enabled:
  1. Navigate to Networking -> Tier-1 Gateways
  2. Expand the T1 router configuration and expand Additional Settings.
In the above screenshot, the T1 router Ingress QoS Profile is not set.