vCenter and vRNI Report 100% NSX-T Edge CPU Utilization but Edge vCPUs are not fully utilized
search cancel

vCenter and vRNI Report 100% NSX-T Edge CPU Utilization but Edge vCPUs are not fully utilized

book

Article ID: 329045

calendar_today

Updated On:

Products

VMware NSX VMware NSX-T Data Center

Issue/Introduction

This article aims to explain how the vCenter and vRNI report the Edge CPU statistics compared to the Edge's reporting and also explain how the Edge utilizes the individual vCPUs

vCenter and vRNI report 100% or higher NSX-T Edge CPU Utilization in vCenter but in actuality, Edge vCPUs are not fully utilized.

To confirm this issue is happening we can confirm Edge vCPUs show usage below the threshold for alerting
 

Step 1
SSH session to the ESXi where the active Edge resides. From the ESXi SSH session, run the following

esxtop
 

While esxtop is running, you can narrow what content to display. Use "V" to show only virtual machine worlds. You can select specific rows in esxtop by using "2" to scroll down and "8" to scroll up. You can remove highlighted rows from the view by pressing "4". This may be necessary to cull extraneous rows and view relevant information.

Make note from the NAME column of the edge in question. Here, you'll note the %Used Colum may be a very high percent, up to 500% for the Edge. This is the number that vCenter will report as the CPU usage of the Edge.

In that same row, make note of the GID (Group ID) of the Edge. From the esxtop, you can expand the statistics for that specific group, showing details of all worlds associated with that GID. This is accomplished by pressing "e" then the GID number.

Here, the %Used column shows the CPU % for each individual vCPU on the Edge, the top row of that GID shows the aggregate of the vCPUS of the VM. The %Sys is the kernel threads consumed on the ESX for this VM.

vCenter's total utilization for this Edge is the total vCPU sum plus the kernel threads (%Used + %Sys).

The Edge's report of its own CPU usage is the total vCPU sum.

In this view, you'll see individual vCPUs for the Edge. The number of vCPUs on an NSX-T Edge depends upon the form factor.

Here is list -
Small Edge - 2 vCPU
Medium Edge - 4 vCPU
Large Edge - 8 vCPU
XL Edge - 16 vCPU

Step 2

Get on to Edge to SSH Session to an Edge VM and login with admin credentials and run the following command
get dataplane cpu stats

In the output make sure that the "USAGE" of each core is less than 80%

If it is above 80%, Confirm if there are any packet drops by running the below command on the ESXi host in which the Edge VM resides

esxtop - Press "n" to view Edge VM network usage - or problems by evaluating PKTTX/s, PKTRX/s, %DRPTX, %DRPRX.
 
PPS rate is the reason for high CPU utilization for networking components. Reduce the traffic flow through the edge VM and check if CPU Utilization goes down

Environment

VMware NSX-T Data Center
VMware NSX

Cause

vCenter reports CPU usage in the GUI for any VM using the output of esxtop system time

esxtop system time is an aggregate value of total CPU usage of vCPUs of the VM plus the system time consumed by ESX kernel threads on that VM's behalf (%Used + %Sys)

Note that system time consumed by kernel threads is very low for most VMs, but an Edge has network threads that can consume a lot of CPU handling traffic

Resolution

Get on to Edge to SSH Session to an Edge VM and login with admin credentials and run the following command
get dataplane cpu stats

Consider expanding the edge node infrastructure horizontally or vertically if usage on the dataplane is above 80%

Additional Information

This is a Linux Accounting Bug, relevant to ANY Linux virtual machine with heavy I/O

Edges are simply more prone to the issue as they have such heavy network traffic

This is a helpful link to help interpret esxtop outputs:
Interpreting esxtop Statistics 

Impact/Risks:
vCenter and vRNI show alerts about Edge VM High or 100% CPU Utilization