- Network performance issues, such as packet drops, low throughput, or high latency, typically indicate underlying system limitations. In a virtualized Software-Defined Networking (SDN) environment, the CPU is the most critical shared resource; high-speed Network Interface Cards (NICs) ensure that network bandwidth is rarely the primary bottleneck.
- Troubleshooting network performance issues is a complex undertaking. The following provides general guidelines and key metrics for addressing common network performance issues, which are often caused by high VCPU/datapath CPU utilization or significant physical CPU contention.
VCPU and Datapath CPU Utilization:
- The packet processing workflow involves a complex pipeline of threads, queues, and buffers. Overload in any thread of this pipeline—from vCPUs to kernel network datapath threads—can lead to queue and buffer saturation, which directly results in packet drops and a reduction in overall throughput.
- Consequently, the monitoring of both VCPU and kernel network datapath thread CPU usage is essential, in addition to standard network counters.
- A method to assess thread overload, without needing to check every thread's utilization, is to track the maximum utilization of relevant threads and count the number of threads that exceed 90% utilization. Key CPU Utilization Metrics for Bottleneck Identification:
- vCPUs: Focus on "Maximum utilization for a single vCPU" and the count of "vCPUs over 90% utilization."
- Datapath CPUs (Kernel Network Threads): Track the "Maximum utilization for a single datapath CPU" and the count of "Datapath CPUs over 90% utilization."
Ready Time and Host CPU Utilization :
- Relying only on the reported utilization of individual vCPUs or network threads can be misleading, particularly on an overcommitted host. In such a scenario, the underlying competition for physical CPU time can delay the scheduling of these individual threads.
- As a result, vCPUs and network threads may not receive adequate CPU time, leading to a low reported CPU usage that doesn't reflect the underlying performance problem.
- Ready time quantifies the duration a vCPU or thread must wait before it can be executed on a physical CPU core. Elevated ready time signifies physical CPU contention, a condition typically linked to high Host CPU utilization due to the limited availability of host CPU resources. This CPU starvation subsequently leads to network buffer saturation and, ultimately, packet drops. Key Metrics for Host CPU Contention Identification:
- Host CPU Utilization: This metric measures the host CPU's average utilization.
- Maximum Ready Time Percent: Metrics such as "Maximum ready time percent for vCPU" and "Maximum ready time percent for Datapath CPU" indicate the highest reported contention for these respective components.
- Ready Time Over 1ms: This metric monitors the number of VCPUs and threads where the ready time exceeded the 1ms threshold. The specific metrics are "vCPUs with ready time over 1ms" and "Datapath CPUs with ready time over 1ms.
Drop Counters Due to Performance Overloads:
- Drops stemming from performance overloads occur when a receiving component is too busy to process incoming data, causing its Receive (RX) buffers to fill up.
- Busy vCPUs (VNIC RX): Packets are dropped at the Virtual NIC (VNIC) when the assigned vCPUs are excessively busy. This is indicated by a non-zero "Rx Dropped PPS %" in the Aggregated VNIC Statistics.
- Busy Kernel Network Threads (VMKNIC RX): Drops occur at the VMKernel NIC (VMKNIC) when the kernel threads responsible for handling its RX buffers are busy. This is reflected by a non-zero "Rx Dropped PPS %" in the VMKernel NIC Statistics.
- Busy Kernel Network Threads (PNIC RX): Packets are dropped at the Physical NIC (PNIC) when the kernel threads managing its RX buffers are busy. This is shown in "Rx Dropped PPS %" in the PNIC Statistics.
- Busy Kernel Network Threads (PNIC TX): Packets coming from VMKNICs or VNICs are dropped when the kernel threads managing the PNIC's Transmit (TX) queues are busy. This is accounted for in "Tx Dropped PPS %" in the PNIC Statistics.