ESXi device latency with "performance has deteriorated" messages in ESXi host logs
search cancel

ESXi device latency with "performance has deteriorated" messages in ESXi host logs

book

Article ID: 318927

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms

  • Devices are reporting high latency, virtual machines may become unresponsive during I/O operations
  • Virtual machines may appear to "freeze" or experience brief pauses. In severe cases virtual disks may disconnect or guest OS filesystems may be marked as read-only
  • Increased I/O latency as reflected in log messages has been observed
  • Storage latency alerts have been triggered at the vCenter level

The /var/run/log/vmkernel.log shows "performance has deteriorated" or "I/O latency increased" messages:

[YYYY-MM-DDTHH:MM:SS] cpu51:2098041)WARNING: ScsiDeviceIO: 513: Device naa.600600606001234567890abc performance has deteriorated. I/O latency increased from average value of 38762 microseconds to 776315 microseconds.
[YYYY-MM-DDTHH:MM:SS] cpu47:2098037)WARNING: ScsiDeviceIO: 1443: Device naa.600600606001234567890abc performance has deteriorated. I/O latency increased from average value of 12017 microseconds to 254228 microseconds.

[YYYY-MM-DDTHH:MM:SS] cpu47:2098038)WARNING: ScsiDeviceIO: 1216: Device naa.600600606001234567890abc performance has deteriorated. I/O latency increased from average value of 18057 microseconds to 534229 microseconds.

Environment

VMware vSphere ESXi 6.5.x
VMware vSphere ESXi 6.7.x
VMware vSphere ESXi 7.x
VMware vSphere ESXi 8.x
VMware vSphere ESXi 9.x

Cause

The issue may occur when:
a) The ratio of latency is significantly higher than previous logging
b) The latency ratio has doubled since the last log was written
 
The device latency may increase due to any of the following reasons:
  • Changes made to the target
  • Disk or media failures
  • Overload conditions on the device
  • A failover event

The numbers reported in the events are in microseconds and refer to the DAVG measurements in the esxtop storage screen. See Using esxtop to identify storage performance issues for ESXi

With traditional non-flash based technologies, the generally accepted threshold is about 10 milliseconds (10,000 microseconds). 

With flash-based storage it is rare to see DAVG latency above 1-2 milliseconds, so these events should be investigated if the latency is higher. 

Latency is a measure of the round-trip time between the issuance of a SCSI command from the hypervisor, through the transport to the surface of the media, and the return. Therefore, the source of the delay could be anywhere in the fabric, the storage infrastructure, or anywhere along the storage path.

Resolution

 
To get the LUN-level device performance statistics data, use the esxtop utility: Using esxtop to identify storage performance issues for ESXi

High device latency:

  • If the device latency is high for a consistent period of time, check the storage performance. If failures are logged on the storage array side, contact the storage vendor for further assistance
  • Check if these messages are generated during any scheduled tasks such as backups or replications, as these can cause intermittent performance problems

Overload conditions on the device:
If the message is generated because of an overload condition, reduce the load on the affected storage device.

Additional Information

Framework to characterize the latency. 

1) Magnitude:  How high are the spikes in DAVG?

2) Duration:  How long does each spike last?

3) Frequency:  What pattern is exhibited by the date/time stamps?

4) Scope:  How widespread are the events?

  • On one datastore, or multiple datastores?
  • On one ESXi host, or multiple ESXi hosts?
  • In one HA/DRS Cluster, or multiple clusters?

Magnitudes of limited amount for example, 20-30ms for a duration of only a few seconds, on an occasional frequency, on a small subset of datastores, is a vastly different situation than magnitudes of multiple seconds, for a duration of multiple minutes. 

Finally, note that ESXi does not cause the latency spikes; it merely reports them. The root cause cannot be determined from the ESXi perspective alone. However, the data outlined above can help guide the investigation outside of the ESXi hosts.