"performance has deteriorated" messages in ESXi host logs
search cancel

"performance has deteriorated" messages in ESXi host logs

book

Article ID: 318927

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

ESXi host reports below message in vmkernel.log when the latency on the device is higher than the average latency:

Device naa.xxxxx123 performance has deteriorated. I/O latency increased from average value of 1832 microseconds to 19403 microseconds

Environment

ESXi

Cause

This occurs when either the latency ratio to the last time the log was updated is 30 or if the ratio doubled since the last log. The device latency may increase due to one of these reasons:
  • Changes made on the target
  • Disk or media failures
  • Overload conditions on the device
  • Failover
The numbers reported in the events are measured in microseconds, and they refer to DAVG measurements, as seen in "esxtop" storage displays.  
 
With traditional storage media (prior to flash-based technologies), the generally accepted threshold above which storage performance might be considered a constraint on performance, was 10 milliseconds (10,000 microseconds).  
 
With flash-based storage. it is rare to see DAVG latencies above 1-2 milliseconds, so these events should be investigated if the frequency is high.  
 
The latency is a measure of the round-trip time between the issuance of a SCSI command from the hypervisor, through the transport to the surface of the media, and return.  
 
So, the source of any delay could be anywhere in the fabric, or the storage infrastructure, or both. 

Resolution

To get the lun level device performance statistics data use the esxtop utility. Refer to this article for more information:  Using esxtop to identify storage performance issues (1008205)

High device latency

If the device latency is too high for a consistent period of time, check the storage performance by verifying the logs on the storage array for any indication of a failure. If failures are logged on the storage array side, take corrective actions. Contact your storage vendor for information regarding checking logs on the array.

Also, check if these messages are generated when there were any scheduled tasks, such as backups or replications, as these can also cause intermittent performance hits.

Overload conditions on the device

If the message is generated because of an overload condition, attempt to reduce the load on the affected storage device.
 
LUN replication tool is running

If running a LUN replication tool, pause the task from the storage end and attempt a storage vMotion to a different datastore. This should help improve the I/O operations.

Additional Information

It is useful to consider a 4-dimensional framework to characterize the latency. 

1) Magnitude:  How high are the spikes in DAVG?

2) Duration:  How long does each spike last?

3) Frequency:  What sort of pattern is exhibited by the date / time stamps?

4) Spread:  How widespread are the events?

  • On one datastore, or multiple datastores?
  • On one ESXi host, or multiple ESXi hosts
  • In one HA/DRS CLuster, or multiple clusters?

Magnitudes of limited amount (say, for example, 20-30ms) for durations of only a few seconds, on an occasional frequency, on a small subset of datastores, is a vastly different situation than, say, magnitudes of multiple seconds, for durations of multiple minutes.  The latter situation could be perceived by most VMs as a storage outage, and Linux machines, for example, can turn their disks read only as a protective measure.