Network Performance Monitoring or troubleshooting with VMware Cloud Foundation Operations (VCFOPS)

Products

VMware vSphere ESXi VCF Operations

Issue/Introduction

You want to use the metrics in VCF Operations to troubleshoot or monitor for suspected packet drops, low throughput or high latency on a VM
This KB article details how to diagnose or monitor for performance bottlenecks, focusing on how CPU exhaustion leads to network performance degradation and packet loss.
It covers key metrics like CPU utilization, scheduling delays (Ready Time), and packet drops. Please note that the metrics covered are available from VCFOPS 9.1.

Environment

VMware vSphere ESX

VCF Operations 9.1

Cause

Network performance issues, such as packet drops, low throughput, or high latency, typically indicate underlying system limitations. In a virtualized Software-Defined Networking (SDN) environment, the CPU is the most critical shared resource; high-speed Network Interface Cards (NICs) ensure that network bandwidth is rarely the primary bottleneck.
Troubleshooting network performance issues is a complex undertaking. The following provides general guidelines and key metrics for addressing common network performance issues, which are often caused by high VCPU/datapath CPU utilization or significant physical CPU contention.

VCPU and Datapath CPU Utilization:

The packet processing workflow involves a complex pipeline of threads, queues, and buffers. Overload in any thread of this pipeline—from vCPUs to kernel network datapath threads—can lead to queue and buffer saturation, which directly results in packet drops and a reduction in overall throughput.
Consequently, the monitoring of both VCPU and kernel network datapath thread CPU usage is essential, in addition to standard network counters.
A method to assess thread overload, without needing to check every thread's utilization, is to track the maximum utilization of relevant threads and count the number of threads that exceed 90% utilization. Key CPU Utilization Metrics for Bottleneck Identification:
- vCPUs: Focus on "Maximum utilization for a single vCPU" and the count of "vCPUs over 90% utilization."
- Datapath CPUs (Kernel Network Threads): Track the "Maximum utilization for a single datapath CPU" and the count of "Datapath CPUs over 90% utilization."

Ready Time and Host CPU Utilization :

Relying only on the reported utilization of individual vCPUs or network threads can be misleading, particularly on an overcommitted host. In such a scenario, the underlying competition for physical CPU time can delay the scheduling of these individual threads.
As a result, vCPUs and network threads may not receive adequate CPU time, leading to a low reported CPU usage that doesn't reflect the underlying performance problem.
Ready time quantifies the duration a vCPU or thread must wait before it can be executed on a physical CPU core. Elevated ready time signifies physical CPU contention, a condition typically linked to high Host CPU utilization due to the limited availability of host CPU resources. This CPU starvation subsequently leads to network buffer saturation and, ultimately, packet drops. Key Metrics for Host CPU Contention Identification:
- Host CPU Utilization: This metric measures the host CPU's average utilization.
- Maximum Ready Time Percent: Metrics such as "Maximum ready time percent for vCPU" and "Maximum ready time percent for Datapath CPU" indicate the highest reported contention for these respective components.
- Ready Time Over 1ms: This metric monitors the number of VCPUs and threads where the ready time exceeded the 1ms threshold. The specific metrics are "vCPUs with ready time over 1ms" and "Datapath CPUs with ready time over 1ms.

Drop Counters Due to Performance Overloads:

Drops stemming from performance overloads occur when a receiving component is too busy to process incoming data, causing its Receive (RX) buffers to fill up.
- Busy vCPUs (VNIC RX): Packets are dropped at the Virtual NIC (VNIC) when the assigned vCPUs are excessively busy. This is indicated by a non-zero "Rx Dropped PPS %" in the Aggregated VNIC Statistics.
- Busy Kernel Network Threads (VMKNIC RX): Drops occur at the VMKernel NIC (VMKNIC) when the kernel threads responsible for handling its RX buffers are busy. This is reflected by a non-zero "Rx Dropped PPS %" in the VMKernel NIC Statistics.
- Busy Kernel Network Threads (PNIC RX): Packets are dropped at the Physical NIC (PNIC) when the kernel threads managing its RX buffers are busy. This is shown in "Rx Dropped PPS %" in the PNIC Statistics.
- Busy Kernel Network Threads (PNIC TX): Packets coming from VMKNICs or VNICs are dropped when the kernel threads managing the PNIC's Transmit (TX) queues are busy. This is accounted for in "Tx Dropped PPS %" in the PNIC Statistics.

Resolution

When high VCPU/datapath CPU utilization is the core problem, the most straightforward approach is to increase the number of VMs or VCPUs, as well as the number of datapath threads (kernel network threads). Note that increasing the number of VMs and VCPUs may require careful planning and potential modifications at the application level.
For instructions on how to increase the number of datapath threads, please refer to the following resource: KB312057
High CPU contention that persists, even with DRS enabled, frequently points to inadequate physical compute capacity within the cluster. The definitive solution is to increase the number of servers. This action directly resolves the resource deficiency, enabling DRS to re-establish optimal resource distribution and thereby eliminate the persistent high CPU contention.
VCF Networking offers multiple "host switch modes." Enhanced Datapath (EDP) generally provides superior performance. EDP has two modes. EDP Standard is recommended for typical enterprise workloads, while EDP Dedicated can be used for applications with more stringent performance requirements. EDP Dedicated requires a thorough initial analysis of performance needs, as well as ongoing analysis and adjustments to the configuration as workloads evolve.
Optimal EDP Dedicated performance relies on careful resource planning. This is crucial because a portion of the performance improvement is achieved by dedicating a fixed amount of CPU resources exclusively to network processing. Improper resource provisioning can adversely affect the overall system performance. Under-provisioning might cause a performance drop, while over-provisioning can deprive application VMs of necessary CPU cycles.
Refer to the KB402229 for details on how to enable EDP Standard mode.

Network Performance Monitoring or troubleshooting with VMware Cloud Foundation Operations (VCFOPS)

Article ID: 419008

Updated On:

Products

Issue/Introduction

Environment

Cause

Resolution

Feedback