vSAN -- NVMe -- Physical Disk Failure but Disk is still online
search cancel

vSAN -- NVMe -- Physical Disk Failure but Disk is still online

book

Article ID: 433789

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Your vSAN Disks are NVMe's.

And you observe the following Situation:

 

A physical Disk failure has been detected on a vSAN Node, though the affected Disk group is still healthy.
 
 
Example: Physical Disk Failure is detected via the Vendor Hardware Management Console, e.g. iDRAC from Dell:

 

 

But via the vSAN Disk Management in the vSphere Client you do not see any unhealthy Disks. All Disks are showing healthy:

Example:

 

Nor do you see an reported Disk issue in vSAN Health (= Physical Disk Health - Operation Health

 

But via the ESXi Host Logs you observe the following error:

/var/log/vobd.log
/var/log/hostd.log

YYYY-MM-DDTHH:MM:SS.ZZ In(14) vobd[2098916]:  [vSANCorrelator] 16634690082948us: [vob.vsan.lsom.backupfailednvmediskhealthcriticalwarning] NVMe critical health warning for disk ###### is: The disk's backup device has failed.
YYYY-MM-DDTHH:MM:SS.ZZ In(14) vobd[2098916]:  [vSANCorrelator] 16634682466368us: [esx.problem.vob.vsan.lsom.backupfailednvmediskhealthcriticalwarning] NVMe critical health warning for disk ###### is: The disk's backup device has failed.
YYYY-MM-DDTHH:MM:SS.ZZ In(166) Hostd[2103842]: [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 136717 : NVMe critical health warning for disk ######.  The disk's backup device has failed.

 

Environment

vSAN ( All Versions)

Cause

HW fault on one of the vSAN NVMe Disk: Failed Volatile Memory Backup Device

The failure is with a HW Component on the NVMe Device.
This failure does not result in a full failure of the NVMe Device. As a result the NVMe Device might show as online and healthy within vSAN.
 
 
Wikipedia - S.M.A.R.T Reference: Critical Warning: Volatile memory backup device failed. This usually means power-loss protection capacitor.
 

Resolution

Contact your HW Vendor for further investigation of the Health of the affected NVMe Device.

Additional Information


  • Critical Warning (CWARN): This field indicates critical warnings for the Controller.
    The value of this field shall indicate the value of the Critical Warning field in the Controller’s SMART / Health Information log page. 
    Volatile Memory Backup Failed (VMBF): This bit shall indicate the same value as the Volatile Memory Backup Failed (VMBF) bit (i.e., bit 4) in the Critical Warning field in the Controller’s SMART / Health Information log page.