"performance has deteriorated" messages in ESXi host logs

Products

VMware vSphere ESXi

Issue/Introduction

ESXi host reports below message in vmkernel.log when the latency on the device is higher than the average latency:

Device naa.xxxxx123 performance has deteriorated. I/O latency increased from average value of 1832 microseconds to 19403 microseconds

Environment

VMware vSphere ESXi 6.x
VMware vSphere ESXi 7.x
VMware vSphere ESXi 8.x

Cause

This occurs when either the latency ratio to the last time the log was updated is 30 or if the ratio doubled since the last log. The device latency may increase due to one of these reasons:

Changes made on the target
Disk or media failures
Overload conditions on the device
Failover

The numbers reported in the events are measured in microseconds, and they refer to DAVG measurements, as seen in "esxtop" storage displays.

With traditional storage media (prior to flash-based technologies), the generally accepted threshold above which storage performance might be considered a constraint on performance, was 10 milliseconds (10,000 microseconds).

With flash-based storage. it is rare to see DAVG latencies above 1-2 milliseconds, so these events should be investigated if the frequency is high.

The latency is a measure of the round-trip time between the issuance of a SCSI command from the hypervisor, through the transport to the surface of the media, and return.

So, the source of any delay could be anywhere in the fabric, or the storage infrastructure, or both.

Resolution

To get the lun level device performance statistics data use the esxtop utility. Refer to this article for more information: Using esxtop to identify storage performance issues (1008205)

High device latency

If the device latency is too high for a consistent period of time, check the storage performance by verifying the logs on the storage array for any indication of a failure. If failures are logged on the storage array side, take corrective actions. Contact your storage vendor for information regarding checking logs on the array.

Also, check if these messages are generated when there were any scheduled tasks, such as backups or replications, as these can also cause intermittent performance hits.

Overload conditions on the device

If the message is generated because of an overload condition, attempt to reduce the load on the affected storage device.

LUN replication tool is running

If running a LUN replication tool, pause the task from the storage end and attempt a storage vMotion to a different datastore. This should help improve the I/O operations.

NOTE: Please also see the "Additional Information" section below, to better scope the extent of the latency issues you are seeing.

Additional Information

It is useful to consider a 4-dimensional framework to characterize the latency.

1) Magnitude: How high are the spikes in DAVG?

2) Duration: How long does each spike last?

3) Frequency: What sort of pattern is exhibited by the date / time stamps?

4) Spread: How widespread are the events?

On one datastore, or multiple datastores?
On one ESXi host, or multiple ESXi hosts
In one HA/DRS CLuster, or multiple clusters?

Magnitudes of limited amount (say, for example, 20-30ms) for durations of only a few seconds, on an occasional frequency, on a small subset of datastores, is a vastly different situation than, say, magnitudes of multiple seconds, for durations of multiple minutes. The latter situation could be perceived by most VMs as a storage outage, and Linux machines, for example, can turn their disks read only as a protective measure.

A USEFUL STRATEGY TO SCOPE THE EXTENT:

1) Extract all of the events from vmkernel.log (and its .gz rotations) that contain the string "performance has deteriorated", then export those events and import them into a spreadsheet application such as Excel, or Open Office.

2) With the raw data in the sheet, parse the data into Columns.

3) Example Column headings would be:

ESXi Host name
Date (Based on time in UTC)
Time in UTC
Datastore device ID
Magnitude of the Spike (this is in microseconds in the event message, divide by 1,000 to convert to milliseconds, which is the more commonly discussed measure of latency).

4) With this data, then sort the data by Hostname, Date and Time in UTC.

Then, it is reasonably easy to calculate a "Duration between event and previous event in hh:mm:ss.milliseconds" value, which is the difference between the time stamps (making sure that the two events being subtracted are for the same host, datastore and date).
At this point, then copy it all out and paste it back using "paste special", to be able to sort the data on any column without worry that formulas will alter the data.

5) Finally, use the Data --> Filter feature of Excel to analyze the data by Magnitude, Duration, Frequency and Spread.

The idea here is to get a sense of the answers to questions like:
- Are the spikes on one datastore, or many?
- Are the spikes on one host, or many?
- Are the spikes sufficiently high to cause workload issues? (i.e. > 30 milliseconds for extended duration)
- What date / time patterns are observed, and how might those correlate to other logs such as storage array, physical switches, etc.

6) Finally, please remember that ESXi does not cause the latency spikes -- it merely reports them. The cause is not possible to determine from an ESXi point of view, but data as outlined above can help inform the investigation outside of the ESXi host(s).