Experiencing Storage Latency

Products

VMware vSphere ESXi

Issue/Introduction

This can cause VM performance issues and disk slowness.

When you are experiencing storage latency you may encounter the following errors on vCenter:

Path redundancy to storage device naa.################################ degraded. Path vmhba#:C#:T#:L# is down. Affected datastores: Datastore1.
Lost access to volume ########-########-####-############# (Datastore1) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
WARNING: ScsiDeviceIO: 1513: Device naa.################################ performance has deteriorated. I/O latency increased from average value of #### microseconds to ##### microseconds.

When you check the host logs in /var/run/log/vmkernel.log you may see the following errors:

Cmd 0x# (0x#####, #####) to dev "naa.################################" on path "vmhba#:C#:T#:L#" Failed:
H:0x5 D:0x0 P:0x0 Aborted at driver layer.
- This status is returned if the driver has to abort commands in-flight to the target. This can occur due to a command timeout or parity error in the frame (See
Cmd (0x#) 0x#, cmdId.initiator=0x######## CmdSN 0x#### from world ####### to dev "naa.################################" failed H:0x8 D:0x0 P:0x0 Cancelled from driver layer
- This status is returned when the HBA driver has aborted the I/O. It can also occur if the HBA does a reset of the target.

When you check the host logs in /var/run/log/vobd.log you may see the following error:

SWISCSI allows for passing iSCSI traffic over an IP network. Converged Network Adapters (CNA) contain network and fibre channel functionalities in the same card. Thus, network errors/events may be pertinent to storage errors/events as well. When you check the host logs in /var/run/log/vobd.log you may see the following error:
- Uplink: vmnic# is down. Affected dvPort: ##/50 24 e2 d9 41 e2 48 58-## ## ## ## ## ## ## ##. 3 uplinks up. Failed criteria: ###
  - You can check the meaning of the error message here but this is indicative of a network failure on that NIC

When you test the performance with esxtop. you find that the DAVG is high

If you run an esxtop to test storage performance and can catch the latency, you find that the Device Average (DAVG) is jumping above 20ms. When assessing storage via the esxtop tool, the DAVG is how long it takes for the I/O to be sent from host, through the SAN and to the array, and acknowledged back.

Environment

VMware vSphere ESXi 8.0.x
VMware vSphere ESXi 7.0.x
VMware vSphere ESXi 6.7.x

Cause

Since the DAVG is high, we are getting in-flight aborts, and driver aborts we know that the latency is coming from either the network or storage driver/firmware, the fabric/ network, or the array itself.

Resolution

To find the exact source we need to reach out to the storage and network vendor. However, to narrow down the search we need to check:

Are the NIC and HBA drivers compatible, and are you running the correct firmware for these drivers:
- In order to check this see: Determining Network/Storage firmware and driver version in ESXi
Are the SFP modules and cables having hardware issues?
Are the switches dropping any packets or seeing the aborts?
Are we seeing any hardware failures on the array?
If you are receiving the Path Redundancy to Storage Device Degraded message on the ESXi host then you may be able to narrow down your network investigation by using the pathing information to determine if one side of the fabric/network is failing.

Check these and if you cannot find the source of the latency on your own then you need to contact your network and storage vendor to have them investigate the latency.