Virtual Machines Experience Increased Storage Latency and Slow IO Due to vSAN Network Congestion During Backup Activity
search cancel

Virtual Machines Experience Increased Storage Latency and Slow IO Due to vSAN Network Congestion During Backup Activity

book

Article ID: 415001

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:

  • Virtual machines residing on vSAN experience increased latency during a recurring weekly time interval when backend backup jobs are running simultaneously.
  • Increased DOM Client latency is observed.

  • DOM Client Outstanding IO (OIO) increases during the affected timeframe, indicating elevated IO activity.

  • There are no alerts reported in the VSAN skyline health 

Environment

  • VMware VSAN 8.x
  • VMware vSAN 9.x

Cause

  • Based on the available vSAN performance metrics, I/O Trip Analyzer data, RDT statistics, and ESXi TCP stack statistics, the issue is caused by temporary network congestion affecting vSAN communication between ESXi hosts during periods of high virtual machine I/O activity.
  • During the affected timeframe, a sudden increase in virtual machine I/O and backup activity causes high network throughput, approaching the maximum capacity of the uplink. As the network becomes saturated, packet drops, retransmissions, and TCP communication delays are observed.
  • In a vSAN environment, storage I/O requires communication between multiple ESXi hosts. When the network experiences congestion or retransmissions, remote vSAN I/O operations become delayed. These delays force the vSAN layer into repeated retry operations, which increases DOM Client latency and Outstanding IO (OIO).
  • As a result, virtual machines can experience slow storage performance, increased latency, slow reads or writes, and delayed transaction processing.
  • The following observations support this behavior during the issue window:
    • Increased Rx and Tx network throughput is observed on the ESXi hosts.

    • High RDT latency is reported, indicating the underlying network is unable to handle the traffic load.

    • TCP congestion indicators, retransmissions, and packet drops are observed from the ESXi TCP stack statistics.
    • tcpSndZeroWin events are observed, indicating temporary receiver-side buffer exhaustion.

    • vSAN I/O Trip Analyzer reports latency at the network layer, indicating network congestion or network hardware limitations.To use IO trip analyzer please refer: Use vSAN I/O Trip Analyzer
    • For example, in an environment with underlying network layer issues, the affected areas are reported in red by the I/O Trip Analyzer.

    • When the red icon is selected, additional details are displayed indicating the affected layer. In this example, the issue is identified at the networking layer, and the following details are displayed.

    • The following TCP stack statistics indicate network communication stress during the affected timeframe:

# vsish -e cat /net/tcpip/instances/defaultTcpipStack/stats/tcp
drops: 204639 --> TCP connections dropped at the network stack level
conndrops: 193 --> Half-open TCP connections dropped
timeoutdrop: 8 --> TCP retransmission timeout drops

These observations indicate that temporary network saturation and retransmission behavior are impacting vSAN I/O performance and virtual machine latency.

Resolution

Identify the guest operating system, application, or backup workload generating the high I/O activity during the affected timeframe, as elevated virtual machine I/O can contribute to vSAN network saturation and increased storage latency.

Engage the Network team to investigate the observed network congestion, packet drops, retransmissions, and increased throughput on the vSAN network. Validate that no network bottlenecks are present during peak workload or backup activity windows.

If the issue persists, collect the vSAN performance data by following the steps documented in Collecting vSAN Performance Service data for vSAN performance issues and contact Broadcom Support for further