ESXi device latency with "performance has deteriorated" messages in ESXi host logs

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms

Devices are reporting high latency, virtual machines may become unresponsive during I/O operations.
Virtual machines may appear to "freeze" or experience brief pauses. In severe cases virtual disks may disconnect or guest OS filesystems may be marked as read-only.
Increased I/O latency as reflected in log messages has been observed.
Storage latency alerts have been triggered at the vCenter level.
Also, high I/O wait observed for VMs from the application level.

The /var/run/log/vmkernel.log shows "performance has deteriorated" or "I/O latency increased" messages:

[YYYY-MM-DDTHH:MM:SS] cpu51:2098041)WARNING: ScsiDeviceIO: 513: Device naa.########## performance has deteriorated. I/O latency increased from average value of 38762 microseconds to 776315 microseconds.
[YYYY-MM-DDTHH:MM:SS] cpu47:2098037)WARNING: ScsiDeviceIO: 1443: Device naa.######### performance has deteriorated. I/O latency increased from average value of 12017 microseconds to 254228 microseconds.
[YYYY-MM-DDTHH:MM:SS] cpu47:2098038)WARNING: ScsiDeviceIO: 1216: Device naa.######### performance has deteriorated. I/O latency increased from average value of 18057 microseconds to 534229 microseconds.

Environment

VMware vSphere ESXi 6.7.x
VMware vSphere ESXi 7.x
VMware vSphere ESXi 8.x
VMware vSphere ESXi 9.x

Cause

The issue may occur when:

a) The ratio of latency is significantly higher than previous logging
b) The latency ratio has doubled since the last log was written

The device latency may increase due to any of the following reasons:

Changes made to the target
Disk or media failures
Overload conditions on the device
A failover event

The numbers reported in the events are in microseconds and refer to the DAVG measurements in the esxtop storage screen. See Using esxtop to identify storage performance issues for ESXi

With traditional non-flash based technologies, the generally accepted threshold is about 10 milliseconds (10,000 microseconds).

With flash-based storage it is rare to see DAVG latency above 1-2 milliseconds, so these events should be investigated if the latency is higher.

Latency is a measure of the round-trip time between the issuance of a SCSI command from the hypervisor, through the transport to the surface of the media, and the return. Therefore, the source of the delay could be anywhere in the fabric, the storage infrastructure, or anywhere along the storage path.

Resolution

To get the LUN-level device performance statistics data, use the esxtop utility: Using esxtop to identify storage performance issues for ESXi

High device latency:

If the device latency is high for a consistent period of time, check the storage performance. If failures are logged on the storage array side, contact the storage vendor for further assistance
Check if these messages are generated during any scheduled tasks such as backups or replications, as these can cause intermittent performance problems

Overload conditions on the device:
If the message is generated because of an overload condition, reduce the load on the affected storage device.

Latency duration:
It's important to view in the vmkernel.log how much time elapsed between the "performance has deteriorated" log entry and the "performance has improved" log entry. The two corresponding lines in the logs can be determined usually by noting the "latency increased" to value will match the "latency improved" from value. Also, the device ID (naa, eui, etc.) will match. The higher the amount of latency recorded and the longer time between deteriorated and improved times will usually mean higher virtual machine potential impact.

Additional Information

Framework to characterize the latency.

1) Magnitude: How high are the spikes in DAVG?

2) Duration: How long does each spike last?

3) Frequency: What pattern is exhibited by the date/time stamps?

4) Scope: How widespread are the events?

On one datastore, or multiple datastores?
On one ESXi host, or multiple ESXi hosts?
In one HA/DRS Cluster, or multiple clusters?

Magnitudes of limited amount for example, 20-30ms for a duration of only a few seconds, on an occasional frequency, on a small subset of datastores, is a vastly different situation than magnitudes of multiple seconds, for a duration of multiple minutes.

Finally, note that ESXi does not cause the latency spikes; it merely reports them. The root cause cannot be determined from the ESXi perspective alone. However, the data outlined above can help guide the investigation outside of the ESXi hosts.