ESXi device latency with "performance has deteriorated" messages in ESXi host logs.

Products

VMware vSphere ESXi

Issue/Introduction

Devices are reporting high latency; virtual machines may become unresponsive during I/O operations.
Virtual machines may appear to "freeze" or experience brief pauses. In severe cases, virtual disks may disconnect or guest OS filesystems may be marked as read-only.
Increased I/O latency as reflected in log messages has been observed.
Storage latency alerts have been triggered at the vCenter level.
High I/O wait is observed for VMs from the application level.
VM clone tasks are getting queued.

The following log entries are identified within /var/run/log/vmkernel.log:

[YYYY-MM-DDTHH:MM:SS] cpu51:2098041)WARNING: ScsiDeviceIO: 513: Device naa.########## performance has deteriorated. I/O latency increased from average value of 38762 microseconds to 776315 microseconds.
[YYYY-MM-DDTHH:MM:SS] cpu47:2098037)WARNING: ScsiDeviceIO: 1443: Device naa.######### performance has deteriorated. I/O latency increased from average value of 12017 microseconds to 254228 microseconds.
[YYYY-MM-DDTHH:MM:SS] cpu47:2098038)WARNING: ScsiDeviceIO: 1216: Device naa.######### performance has deteriorated. I/O latency increased from average value of 18057 microseconds to 534229 microseconds.

Additional symptoms reported

"We have been experiencing some of our applications slowing down"

Environment

vSphere ESX 9.x
vSphere ESXi 8.x
vSphere ESXi 7.x
vSphere ESXi 6.7.x

Cause

The issue may occur when:

The ratio of latency is significantly higher than previous logging.
The latency ratio has doubled since the last log was written.

The device latency may increase due to:

Changes made to the target
Disk or media failures
Overload conditions on the device
A failover event

The numbers reported in the events are in microseconds and refer to the DAVG measurements in the esxtop storage screen.

Refer: Using esxtop to identify storage performance issues for ESXi

With traditional non-flash based technologies, the generally accepted threshold is about 10 milliseconds (10,000 microseconds).

With flash-based storage, it is rare to see DAVG latency above 1-2 milliseconds, so these events should be investigated if the latency is higher.

Latency is a measure of the round-trip time between the issuance of a SCSI command from the hypervisor, through the transport to the surface of the media, and the return. Therefore, the source of the delay could be anywhere in the fabric, the storage infrastructure, or anywhere along the storage path.

Resolution

To get the LUN-level device performance statistics data, use the esxtop utility; refer Using esxtop to identify storage performance issues for ESXi.

High device latency:

If the device latency is high for a consistent period of time, check the storage performance. If failures are logged on the storage array side, contact the storage vendor for further assistance.
Check if these messages are generated during any scheduled tasks such as backups or replications, as these can cause intermittent performance problems.

Overload conditions on the device:

If the message is generated because of an overload condition, reduce the load on the affected storage device.

Latency duration:

It's important to check the vmkernel.log how much time elapsed between the "performance has deteriorated" log entry and the "performance has improved" log entry.
The two corresponding lines in the logs can usually be determined by noting that the "latency increased" to value will matches the "latency improved" from value.
Also, the device ID (naa, eui, etc.) will match.
A higher amount of recorded latency and a longer duration between the 'deteriorated' and 'improved' log entries typically result in a greater potential impact on the virtual machine.

Additional Information

Use the following framework to characterize the observed latency:

Magnitude: How high are the spikes in DAVG?
Duration: How long does each spike last?
Frequency: What pattern is exhibited by the date and time stamps?
Scope: How widespread are the events?
- Are they on one datastore or multiple datastores?
- Are they on one ESXi host or multiple ESXi hosts?
- Are they in one HA/DRS cluster or across multiple clusters?

For example, a limited magnitude (e.g., 20–30 ms) occurring for only a few seconds on a small subset of datastores represents a vastly different situation than magnitudes of multiple seconds lasting for several minutes.

Note: ESXi does not cause latency spikes; it merely reports them. While the root cause cannot be determined from the ESXi perspective alone, the data outlined above can help guide an investigation into the external storage infrastructure.