vSAN performance and/or data-availability issues due to device(s) with critically low remaining 'Media Wearout Indicator' value.

Products

VMware vSAN VMware vSAN 7.x VMware vSAN 8.x VMware vSAN 6.x

Issue/Introduction

SSD storage devices use flash memory for writing and storing data.

These devices are vendor-configured with overprovisioned capacity that is not visible to the user/hypervisor/Guest-OS that is used to improve the performance of the device (e.g. via wear-levelling) and to extend the lifespan/endurance of the device (e.g. via replacing failed/failing blocks).

This overprovisioned space is not an unlimited resource - how much overprovisioned space a device has will vary depending on the rated endurance class/DWPD, size of the device and vendor specification.

Environment

VMware vSAN 6.x
VMware vSAN 7.x
VMware vSAN 8.x

Cause

Over time, with write IOs to the storage device, it will eventually start actively using an increasing amount of the overprovisioned blocks - this results in a decreasing amount of these blocks being available, this is noted by the SMART value for 'Media Wearout Indicator' on the device decreasing from the initial value on a new device of 'Media Wearout Indicator 100'.

If a device is ever used to the point that the available overprovisioned space is starting to become exhausted (e.g. 'Media Wearout Indicator 10' - 10% remaining) this can result in the device behaving abnormally, typical symptom of a device with only a few % remaining is increased (and sometimes sporadic/intermittent) latency observed on the device.

This may be logged in vmkernel.log with pattern similar to (where naa.xxxxxxxxxxxxxxxx is the identifier of the device):

WARNING: ScsiDeviceIO: xxxx: Device naa.xxxxxxxxxxxxxxxx performance has deteriorated. I/O latency increased from average value of 396 microseconds to 168821 microseconds.
WARNING: ScsiDeviceIO: xxxx: Device naa.xxxxxxxxxxxxxxxx performance has deteriorated. I/O latency increased from average value of 396 microseconds to 329795 microseconds.
WARNING: ScsiDeviceIO: xxxx: Device naa.xxxxxxxxxxxxxxxx performance has deteriorated. I/O latency increased from average value of 396 microseconds to 735467 microseconds.

This can also be validated via SSH to the node with 'esxtop' 'u' option, problematic devices may have unexpectedly high DAVG/cmd (e.g. 100ms for IOs to be processed).

This issue is more likely to occur on devices used as vSAN Cache-tier devices as these service more write IOs than Capacity-tier devices, thus it is important to consider using higher endurance class/DWPD/sized devices Cache-tier devices, especially if there workload is write-intensive.

The current 'Media Wearout Indicator' of node-local devices can be checked via SSH to a node:

[root@hostname] localcli storage core device smart get -d naa.xxxxxxxxxxxxxxxx
SMART Data for Disk : naa.xxxxxxxxxxxxxxxx
Parameter Value Threshold Worst Raw
-----------------------------------------------------------
Health Status OK N/A N/A N/A
Media Wearout Indicator 1 0 1 0 <<<---
Power-on Hours 100 0 100 10
Power Cycle Count 100 0 100 16
Reallocated Sector Count 100 10 100 0
Drive Temperature 100 0 100 27
Write Sectors TOT Count 100 0 100 182
Read Sectors TOT Count 100 0 100 112
Initial Bad Block Count 100 0 100 0
Program Fail Count 100 0 100 0
Erase Fail Count 100 0 100 0
Uncorrectable Error Count 100 0 100 0
Pending Sector Reallocation Ct 100 0 100 0
------------------------------------------------------------

Example basic loop to query the SMART stats of all of the node-local disks with vSAN partitions on them (run on each vSAN node via SSH):

# for i in $(vdq -Hi| grep -E "SSD|MD"| awk '{print $2}');do echo $i;localcli storage core device smart get -d $i;done

Not all devices support 'Media Wearout Indicator' SMART stat retrieval and thus these will return a value of 'N/A', this is normal and expected behaviour.

Stats analogous to 'Media Wearout Indicator' values may be available in out-of-band monitoring solutions such as iDRAC/iLO/XClarity which can also be used to indicate remaining overprovisioned space available on devices.

Resolution

Devices exhibiting clear signs of severe issues due to running out of overprovisioned space such as frequent, unexpectedly high latency should be physically replaced immediately.

While there is no consistent % remaining 'Media Wearout Indicator' that devices must have to be expected to function normally, best guidance is that any devices reaching single-digit % should be closely monitored and with vendor-approval, proactively replaced.

It may be possible to infer how long a device has been used for vSAN (or at least since it was last repartitioned) by checking the 'Creation Time' field from the output of 'localcli vsan storage list'.

If one device is in single-digit % 'Media Wearout Indicator' and exhibiting late-stage issues, due to how vSAN distributes IOs, it is possible there are other disks also reaching similarly low values and it is advised to check all devices.