vCenter reports NSX Edge CPU usage close to 100% but Edge vCPUs are not fully saturated
book
Article ID: 336538
calendar_today
Updated On:
Products
VMware NSX
Issue/Introduction
This article aims to explain how the vCenter reports the Edge CPU statistics compared to the Edge's reporting and also explain how the Edge utilizes the individual vCPUs
vCenter reports 100% or higher NSX Edge CPU Utilization in vCenter but in actuality, Edge vCPUs are not fully utilized.
To confirm this issue is happening we can confirm Edge vCPUs show usage below the threshold for alerting Step 1
SSH session to the ESXi where the active Edge resides. From the ESXi SSH session, run the following esxtop
While esxtop is running, you can narrow what content to display. Use "V" to show only virtual machine worlds. You can select specific rows in esxtop by using "2" to scroll down and "8" to scroll up. You can remove highlighted rows from the view by pressing "4". This may be necessary to cull extraneous rows and view relevant information.
Make note from the NAME column of the edge in question. Here, you'll note the %Used Colum may be a very high percent, up to 500% for the Edge. This is the number that vCenter will report as the CPU usage of the Edge.
In that same row, make note of the GID (Group ID) of the Edge. From the esxtop, you can expand the statistics for that specific group, showing details of all worlds associated with that GID. This is accomplished by pressing "e" then the GID number.
Here, the %Used column shows the CPU % for each individual vCPU on the Edge, the top row of that GID shows the aggregate of the vCPUS of the VM. The %Sys is the kernel threads consumed on the ESX for this VM.
vCenter's total utilization for this Edge is the total vCPU sum plus the kernel threads (%Used + %Sys).
The Edge's report of its own CPU usage is the total vCPU sum.
In this view, you'll see individual vCPUs for the Edge. The number of vCPUs on an NSX-T Edge depends upon the form factor. Here is list -
Compact - 1 vCPU Large - 2 vCPU Quad Large - 4 vCPU X-Large - 6 vCPU
Step 2
Get on to Edge to SSH Session to an Edge VM and login with admin credentials and run the following command show process monitor
In the output make sure that the "USAGE" of each core is less than 80%
If it is above 80%, Confirm if there are any packet drops by running the below command on the ESXi host in which the Edge VM resides
esxtop - Press "n" to view Edge VM network usage - or problems by evaluating PKTTX/s, PKTRX/s, %DRPTX, %DRPRX.
PPS rate is the reason for high CPU utilization for networking components. Reduce the traffic flow through the edge VM and check if CPU Utilization goes down
Environment
VMware NSX for vSphere 6.4.x
Cause
vCenter reports CPU usage in the GUI for any VM using the output of esxtop system time
esxtop system time is an aggregate value of total CPU usage of vCPUs of the VM plus the system time consumed by ESX kernel threads on that VM's behalf (%Used + %Sys)
Note that system time consumed by kernel threads is very low for most VMs, but an Edge has network threads that can consume a lot of CPU handling traffic
Resolution
Get on to Edge to SSH Session to an Edge VM and login with admin credentials and run the following command
get dataplane cpu stats
Consider expanding the edge node infrastructure horizontally or vertically if usage on the dataplane is above 80%
Additional Information
Take note that on an X-Large Edge, the last two vCPUs are reserved for encryption, load balancing, and management function, meaning their %Used may be very low compared to the other 4 vCPUs. On Quad Large and lower, these tasks do not have reserved vCPUs.
This is technically a Linux Accounting Bug, relevant to ANY Linux virtual machine with heavy I/O. Edges are simply more prone to the issue as they have such heavy network traffic.