vSAN (OSA/ESA) does not alert or fail out NVMe disk for certain NVMe SMART errors.

Products

VMware vSAN

Issue/Introduction

1. A NVMe disk failure is detected on a vSAN Node, although the affected disk group/pool remains healthy

2. You may experience a spike in write latency in the environment (<=50 ms)

3. Your server hardware management console like iDRAC, iLO, Lenovo XClarity Controller (XCC) & Cisco IMC (Integrated Management Controller), etc is reporting a drive failure.

4. The vSAN Disk Management in the vSphere Client doesn't display any unhealthy Disks. All Disks are healthy:

5. You may or may not see a reported disk issue in vSAN Health (Physical Disk Health - Operation Health) , however the disk remains mounted as seen in Disk Management.

Environment

vSAN OSA & ESA ( All Versions)

NVMe disks

Cause

vSAN does not automatically fail NVMe drives based on SMART data because these metrics lack industry standardization across vendors. Automated reactions risk false positives, causing unnecessary data rebuilds and cluster performance degradation.

vobd.log:

2026-03-03T20:23:24.602Z In(182) vmkernel: cpu113:4737276)LSOMCommon: LSOMGetSmartData:1478: Getsmart support failed on disk t10.NVMe____Dell_NVMe_ISE_PS1030_MU_U.2_6.4TB_______##############:2
2026-03-03T20:23:25.163Z In(182) vmkernel: cpu68:4737276)LSOMCommon: LSOMGetSmartData:1478: Getsmart support failed on disk t10.NVMe____Dell_NVMe_ISE_PS1030_MU_U.2_6.4TB_______###############:2

2026-02-26T21:00:10.322Z In(14) vobd[2098147]:  The event ([esx.problem.vob.vsan.lsom.backupfailednvmediskhealthcriticalwarning] NVMe critical health warning for disk t10.NVMe____Dell_NVMe_ISE_PS1030_MU_U.2_6.4TB_______################ is: The disk's backup device has failed.) was sent immediately to hostd;
2026-02-26T21:10:10.932Z In(14) vobd[2098147]:  [vSANCorrelator] 170637431890us: [esx.problem.vob.vsan.lsom.backupfailednvmediskhealthcriticalwarning] NVMe critical health warning for disk t10.NVMe____Dell_NVMe_ISE_PS1030_MU_U.2_6.4TB_______################ is: The disk's backup device has failed.

VMkernel.log:

2026-02-27T00:28:57.541Z In(14) vobd[2098147]:  [vSANCorrelator] 115377453105us: [esx.problem.vob.vsan.lsom.backupfailednvmediskhealthcriticalwarning] NVMe critical health warning for disk t10.NVMe____Dell_NVMe_ISE_PS1030_MU_U.2_6.4TB_______############# is: The disk's backup device has failed.
2026-02-27T00:28:57.541Z In(14) vobd[2098147]:  The event ([esx.problem.vob.vsan.lsom.backupfailednvmediskhealthcriticalwarning] NVMe critical health warning for disk t10.NVMe____Dell_NVMe_ISE_PS1030_MU_U.2_6.4TB_______############### is: The disk's backup device has failed.) was sent immediately to hostd

2026-02-26T21:32:23.141Z In(182) vmkernel: cpu120:2099860)LSOM: LSOMLogDiskEvent:8418: Disk Event decommission for MD 52d2d87e-99ec-449f-89d2-############ (t10.NVMe____Dell_NVMe_ISE_PS1030_MU_U.2_6.4TB_______#############:2)

vsandevicemonitord.log:

2026-02-26T20:00:06Z In(14) vsandevicemonitord[2100825]: [70509974144]: WARNING - NVMe critical health warning for disk t10.NVMe____Dell_NVMe_ISE_PS1030_MU_U.2_6.4TB_______############# is: 'The disk's backup device has failed'.
2026-02-26T20:10:07Z In(14) vsandevicemonitord[2100825]: [70509974144]: WARNING - NVMe critical health warning for disk t10.NVMe____Dell_NVMe_ISE_PS1030_MU_U.2_6.4TB_______############# is: 'The disk's backup device has failed'.

[root@ESXi:~] esxcli nvme device log smart get -A vmhba1
SMART And Health Info:
Available Spare Space Below Threshold: false
Temperature Warning: false
NVM Subsystem Reliability Degradation: false
Read Only Mode: false
Volatile Memory Backup Device Failure: true
Composite Temperature: 306 K
Available Spare: 100 %
Available Spare Threshold: 10 %
Percentage Used: 0 %
Data Units Read: 0x60ea0528
Data Units Written: 0x2f8fbda9
Host Read Commands: 0x27f7a927fb
Host Write Commands: 0x12f8084edf
Controller Busy Time: 0x13522
Power Cycles: 0x1a
Power On Hours: 0x3f91
Unsafe Shutdowns: 0x9
Media Errors: 0x0
Number of Error Info Log Entries: 0x2c
Warning Composite Temperature Time: 0 Mins
Critical Composite Temperature Time: 0 Mins
Temperature Sensor 1: 319 K
Temperature Sensor 2: 309 K
Temperature Sensor 3: 0 K
Temperature Sensor 4: 0 K
Temperature Sensor 5: 0 K
Temperature Sensor 6: 0 K
Temperature Sensor 7: 0 K
Temperature Sensor 8: 0 K

[root@ESXi:~] esxcli storage core device smart get -d t10.NVMe____Dell_NVMe_ISE_PS1030_MU_U.2_6.4TB_______#####################
Parameter Value Threshold Worst Raw
------------------------ ------- --------- ----- ---
Health Status WARNING N/A N/A N/A
Power-on Hours 16273 N/A N/A N/A
Power Cycle Count 26 N/A N/A N/A
Reallocated Sector Count 0 90 N/A N/A
Drive Temperature 33 75 N/A N/A

A "Volatile Memory Backup Failed" error (often NVMe SMART critical warning 0x10 or bit 4) indicates that the capacitor or battery designed to save cached data from RAM to NAND during a power loss has failed or is degraded. This risks data loss upon power failure; immediate actions include backing up data and replacing the SSD.

Failed Capacitor/Battery: The power-loss protection (PLP) capacitor on the SSD is faulty. The drive often requires replacement.

SMART Warning Trigger: The SSD’s internal health check detects this failure, often as part of a critical warning (0x10).

Volatile Memory Backup Device Failure: true is not considered as a device failure by vSAN OSA/ESA.

While the 'Volatile Memory Backup' SMART attribute is a widely adopted metric among storage vendors, it currently lacks a unified industry standard defining it as an explicit indicator of drive failure. Without a formal consensus, categorizing this attribute as a critical fault may lead to conflicting interpretations across different hardware manufacturers. Thus, the Dying Disk Handling (DDH) feature in vSAN currently doesn't mark this SMART error for remediation

vSAN Health monitoring remains conservative in its displacement logic; currently, only the 'Subsystem Reliability Degraded' status is strictly classified as a functional failure and formally reported to the Health service as a trigger for replacement or evacuation."

Wikipedia - S.M.A.R.T Reference: Critical Warning: Volatile memory backup device failed. This usually means power-loss protection capacitor.

Resolution

Engage the hardware vendor for further investigation and possible disk replacement.

See KB Enabling vSAN alerts for NVMe SMART data in vCenter to be alerted in vCenter for potential future occurrences.

Additional Information

Enhanced Intelligence in vSAN 8 U3 and VMware Cloud Foundation 5.2

vSAN NVMe disk report read only critical warning

Determination of physical Disk Location on Host

Commands to Display S.M.A.R.T. Data for NVMe Devices in ESXi

ESXi S.M.A.R.T. health monitoring for hard drives

NVMe Specification - 2.0-2024.08.05 - Page 108:

Critical Warning (CWARN): This field indicates critical warnings for the Controller.
The value of this field shall indicate the value of the Critical Warning field in the Controller’s SMART / Health Information log page.
Volatile Memory Backup Failed (VMBF): This bit shall indicate the same value as the Volatile Memory Backup Failed (VMBF) bit (i.e., bit 4) in the Critical Warning field in the Controller’s SMART / Health Information log page.