ESXi device latency with "performance has deteriorated" messages in ESXi host logs

Products

VMware vSphere ESXi

Issue/Introduction

Devices are reporting high latency, causing virtual machines to become unresponsive during I/O operations
Increased I/O latency, reflected in log messages with microsecond-level values, has been observed
Storage latency alerts have been triggered at the vCenter level.

The /var/run/log/vmkernel.log shows "performance has deteriorated" or "I/O latency increased"

[YYYY-MM-DDTHH:MM:SS] cpu51:2098041)WARNING: ScsiDeviceIO: 513: Device naa.######################### performance has deteriorated. I/O latency increased from average value of 38762 microseconds to 776315 microseconds. [YYYY-MM-DDTHH:MM:SS] cpu47:2098037)WARNING: ScsiDeviceIO: 1443: Device naa.######################### performance has deteriorated. I/O latency increased from average value of 12017 microseconds to 254228 microseconds.
[YYYY-MM-DDTHH:MM:SS] cpu47:2098038)WARNING: ScsiDeviceIO: 1216: Device naa.######################### performance has deteriorated. I/O latency increased from average value of 18057 microseconds to 534229 microseconds.

Environment

VMware vSphere ESXi 6.5.x
VMware vSphere ESXi 6.7.x
VMware vSphere ESXi 7.x
VMware vSphere ESXi 8.x

Cause

The issue occurs when:

a) the latency ratio to the last time the log was updated is 30 or higher
b) if the latency ratio has doubled since the last log was written.

The device latency may increase due to any of the following reasons:

Changes made to the target
Disk or media failures
Overload conditions on the device
A failover event

The numbers reported in the events are in microseconds, and refer to the DAVG measurements in the esxtop storage screen. See Using esxtop to identify storage performance issues for ESXi (multiple versions)

With traditional non-flash based technologies, the generally accepted threshold is about 10 milliseconds (10,000 microseconds).

With flash-based storage it is rare to see DAVG latency above 1-2 milliseconds, so these events should be investigated if the latency is higher.

Latency is a measure of the round-trip time between the issuance of a SCSI command from the hypervisor, through the transport to the surface of the media, and the return. Therefore, the source of the delay could be anywhere in the fabric, the storage infrastructure, or anywhere along the path.

Resolution

To get the LUN-level device performance statistics data, use the esxtop utility: Using esxtop to identify storage performance issues for ESXi (multiple versions)

High device latency:
If the device latency is high for a consistent period of time, check the storage performance by verifying the logs on the storage array for any indication of a failure. If failures are logged on the storage array side, contact the storage vendor for further assistance.

Check if these messages are generated during any scheduled tasks such as backups or replications, as these can cause intermittent performance problems.

Overload conditions on the device:
If the message is generated because of an overload condition, reduce the load on the affected storage device.

Additional Information

Framework to characterize the latency.

1) Magnitude: How high are the spikes in DAVG?

2) Duration: How long does each spike last?

3) Frequency: What pattern is exhibited by the date/time stamps?

4) Scope: How widespread are the events?

On one datastore, or multiple datastores?
On one ESXi host, or multiple ESXi hosts?
In one HA/DRS CLuster, or multiple clusters?

Magnitudes of limited amount for example, 20-30ms for a duration of only a few seconds, on an occasional frequency, on a small subset of datastores, is a vastly different situation than magnitudes of multiple seconds, for a duration of multiple minutes.

Finally, note that ESXi does not cause the latency spikes; it merely reports them. The root cause cannot be determined from the ESXi perspective alone. However, the data outlined above can help guide the investigation outside of the ESXi hosts.