vSAN (OSA/ESA) does not alert or fail out NVMe disk for certain NVMe SMART errors.
search cancel

vSAN (OSA/ESA) does not alert or fail out NVMe disk for certain NVMe SMART errors.

book

Article ID: 433789

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

1. A NVMe disk failure is detected on a vSAN Node, although the affected disk group/pool remains healthy 

2. You may experience a spike in write latency in the environment (<=50 ms)

3. Your server hardware management console like iDRAC, iLO, Lenovo XClarity Controller (XCC) & Cisco IMC (Integrated Management Controller), etc is reporting a drive failure. 

4. The vSAN Disk Management in the vSphere Client doesn't display any unhealthy Disks. All Disks are healthy:

5. You may or may not see a reported disk issue in vSAN Health (Physical Disk Health - Operation Health) , however the disk remains mounted as seen in Disk Management.

 

 

 

 

Environment

vSAN OSA & ESA ( All Versions)

NVMe disks

Cause

vobd.log: 

2026-03-03T20:23:24.602Z In(182) vmkernel: cpu113:4737276)LSOMCommon: LSOMGetSmartData:1478: Getsmart support failed on disk t10.NVMe____Dell_NVMe_ISE_PS1030_MU_U.2_6.4TB_______##############:2
2026-03-03T20:23:25.163Z In(182) vmkernel: cpu68:4737276)LSOMCommon: LSOMGetSmartData:1478: Getsmart support failed on disk t10.NVMe____Dell_NVMe_ISE_PS1030_MU_U.2_6.4TB_______###############:2
2026-02-26T21:00:10.322Z In(14) vobd[2098147]:  The event ([esx.problem.vob.vsan.lsom.backupfailednvmediskhealthcriticalwarning] NVMe critical health warning for disk t10.NVMe____Dell_NVMe_ISE_PS1030_MU_U.2_6.4TB_______################ is: The disk's backup device has failed.) was sent immediately to hostd;
2026-02-26T21:10:10.932Z In(14) vobd[2098147]:  [vSANCorrelator] 170637431890us: [esx.problem.vob.vsan.lsom.backupfailednvmediskhealthcriticalwarning] NVMe critical health warning for disk t10.NVMe____Dell_NVMe_ISE_PS1030_MU_U.2_6.4TB_______################ is: The disk's backup device has failed.


VMkernel.log:

2026-02-27T00:28:57.541Z In(14) vobd[2098147]:  [vSANCorrelator] 115377453105us: [esx.problem.vob.vsan.lsom.backupfailednvmediskhealthcriticalwarning] NVMe critical health warning for disk t10.NVMe____Dell_NVMe_ISE_PS1030_MU_U.2_6.4TB_______############# is: The disk's backup device has failed.
2026-02-27T00:28:57.541Z In(14) vobd[2098147]:  The event ([esx.problem.vob.vsan.lsom.backupfailednvmediskhealthcriticalwarning] NVMe critical health warning for disk t10.NVMe____Dell_NVMe_ISE_PS1030_MU_U.2_6.4TB_______############### is: The disk's backup device has failed.) was sent immediately to hostd

2026-02-26T21:32:23.141Z In(182) vmkernel: cpu120:2099860)LSOM: LSOMLogDiskEvent:8418: Disk Event decommission for MD 52d2d87e-99ec-449f-89d2-############ (t10.NVMe____Dell_NVMe_ISE_PS1030_MU_U.2_6.4TB_______#############:2)


vsandevicemonitord.log:

2026-02-26T20:00:06Z In(14) vsandevicemonitord[2100825]: [70509974144]: WARNING - NVMe critical health warning for disk t10.NVMe____Dell_NVMe_ISE_PS1030_MU_U.2_6.4TB_______############# is: 'The disk's backup device has failed'.
2026-02-26T20:10:07Z In(14) vsandevicemonitord[2100825]: [70509974144]: WARNING - NVMe critical health warning for disk t10.NVMe____Dell_NVMe_ISE_PS1030_MU_U.2_6.4TB_______############# is: 'The disk's backup device has failed'.

 

[root@ESXi:~] esxcli nvme device log smart get -A vmhba1
SMART And Health Info:
   Available Spare Space Below Threshold: false
   Temperature Warning: false
   NVM Subsystem Reliability Degradation: false
   Read Only Mode: false
   Volatile Memory Backup Device Failure: true
   Composite Temperature: 306 K
   Available Spare: 100 %
   Available Spare Threshold: 10 %
   Percentage Used: 0 %
   Data Units Read: 0x60ea0528
   Data Units Written: 0x2f8fbda9
   Host Read Commands: 0x27f7a927fb
   Host Write Commands: 0x12f8084edf
   Controller Busy Time: 0x13522
   Power Cycles: 0x1a
   Power On Hours: 0x3f91
   Unsafe Shutdowns: 0x9
   Media Errors: 0x0
   Number of Error Info Log Entries: 0x2c
   Warning Composite Temperature Time: 0 Mins
   Critical Composite Temperature Time: 0 Mins
   Temperature Sensor 1: 319 K
   Temperature Sensor 2: 309 K
   Temperature Sensor 3: 0 K
   Temperature Sensor 4: 0 K
   Temperature Sensor 5: 0 K
   Temperature Sensor 6: 0 K
   Temperature Sensor 7: 0 K
   Temperature Sensor 8: 0 K


[root@ESXi:~] esxcli storage core device smart get -d t10.NVMe____Dell_NVMe_ISE_PS1030_MU_U.2_6.4TB_______#####################
Parameter                 Value    Threshold  Worst  Raw
------------------------  -------  ---------  -----  ---
Health Status             WARNING  N/A        N/A    N/A
Power-on Hours            16273    N/A        N/A    N/A
Power Cycle Count         26       N/A        N/A    N/A
Reallocated Sector Count  0        90         N/A    N/A
Drive Temperature         33       75         N/A    N/A

 

A "Volatile Memory Backup Failed" error (often NVMe SMART critical warning 0x10 or bit 4) indicates that the capacitor or battery designed to save cached data from RAM to NAND during a power loss has failed or is degraded. This risks data loss upon power failure; immediate actions include backing up data and replacing the SSD. 

Failed Capacitor/Battery: The power-loss protection (PLP) capacitor on the SSD is faulty. The drive often requires replacement.

SMART Warning Trigger: The SSD’s internal health check detects this failure, often as part of a critical warning (0x10).

Volatile Memory Backup Device Failure: true is not considered as a device failure by vSAN OSA/ESA. 

While the 'Volatile Memory Backup' SMART attribute is a widely adopted metric among storage vendors, it currently lacks a unified industry standard defining it as an explicit indicator of drive failure. Without a formal consensus, categorizing this attribute as a critical fault may lead to conflicting interpretations across different hardware manufacturers. Thus, the Dying Disk Handling (DDH) feature in vSAN currently doesn't mark this SMART error for remediation 

vSAN Health monitoring remains conservative in its displacement logic; currently, only the 'Subsystem Reliability Degraded' status is strictly classified as a functional failure and formally reported to the Health service as a trigger for replacement or evacuation."

Wikipedia - S.M.A.R.T Reference: Critical Warning: Volatile memory backup device failed. This usually means power-loss protection capacitor.
 

Resolution

Engage the hardware vendor for further investigation and possible disk replacement.

See KB Enabling vSAN alerts for NVMe SMART data in vCenter to be alerted in vCenter for potential future occurrences. 

 

Additional Information

Critical Warning (CWARN): This field indicates critical warnings for the Controller.
The value of this field shall indicate the value of the Critical Warning field in the Controller’s SMART / Health Information log page. 
Volatile Memory Backup Failed (VMBF): This bit shall indicate the same value as the Volatile Memory Backup Failed (VMBF) bit (i.e., bit 4) in the Critical Warning field in the Controller’s SMART / Health Information log page.