There are different scenarios on how a vSAN Disk Group (DG) will respond to a certain failure:
| Deduplication and Compression Enabled? | Failure Type | Symptom |
| Yes | Cache Disk Failure | The whole Disk Group will be down |
| Capacity Disk Failure | The whole Disk Group will be down | |
| No | Cache Disk Failure | The whole Disk Group will be down |
| Capacity Disk Failure | Only the failed Disk will be down |
Using ESX CLI:
2.1 SSH to the effected ESXi host
2.2 Run the following command:
esxcli vsan storage list | grep -i dedup
If Deduplication is enabled you will see the following output for a Disk: Deduplication: true
In case Deduplication & Compression is not enabled:
vCenter Web Client: Cluster --> Monitor --> vSAN --> Skyline Health --> Physical disk --> Operation health
In case Deduplication & Compression is enabled:
Any Disk failure will cause the whole Disk Group to be offline. Please refer: vSAN Deduplication enabled -- Identifying Failed Disk
The physical disk can be located in different ways. Please refer the below articles.
Verify the data health on vSAN cluster: Refer: vSAN Health Service - Data Health – vSAN Object Health
vSAN is expected to start rebuilding the data on next available disk or host depend upon the storage policy to ensure the compliance. The default timer is 60 minutes and would start rebuilding the data after 60 minutes. If there is a situation or need to change the default repair timer, please refer the article Changing the default repair delay time for a host failure in vSAN
Verify the requirements for the disk replacement on vSAN cluster: Requirements when replacing disks in a vSAN cluster
(0x28), WRITE (0x2a) and Medium (0x3 0x11 0x0) errors. /var/log/vmkernel.log2022-10-12T20:03:14.424Z cpu5:2098330)ScsiDeviceIO: 4325: Cmd(0x45be74dae040) 0x28, CmdSN 0xd65263a8 from world 0 to dev "naa.#####" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x02022-10-12T20:46:00.107Z cpu86:2098331)ScsiDeviceIO: 4277: Cmd(0x45de6e7527c0) 0x28, CmdSN 0xcebe from world 0 to dev "naa.#####" failed H:0xc D:0x0 P:0x0
vSAN Disk Or Diskgroup Fails With Medium Errors (0x3 0x11)
egrep -i "perm|offline|unhealthy" /var/log/vobd.log (You can also search on the disk UUID)2022-10-12T20:03:18.694Z: [vSANCorrelator] 27997683071354us: [esx.problem.vob.vsan.lsom.devicerepair] Device ####### is in offline state and is getting repaired2022-10-12T20:46:00.111Z: [vSANCorrelator] 28000195517670us: [vob.vsan.lsom.diskerror] vSAN device ###### is under permanent error.The /var/run/log/vmkernel.log would show the events as below with sense data H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x0 0x0.
/var/run/log/vmkernel.log:2021-01-05T08:37:16.337Z cpu26:2098033)ScsiDeviceIO: 3047: Cmd(0x45a3e27a1700) 0x2a, CmdSN 0x2238d from world 2960707 to dev "naa.#####" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x0 0x0.
Please refer the KB for more information on the issue and fix. vSAN Disk or group may report failure on vSAN
DDH unmounts the disk or reports it unhealthy in the following situations:
localcli storage core device smart get -d #####), localcli storage core device smart get -d naa.#####SMART Data for Disk : naa.#####Parameter Value Threshold Worst-----------------------------------------------------Health Status IMPENDING FAILURE N/A N/AMedia Wearout Indicator N/A N/A N/AWrite Error Count 0 N/A N/ARead Error Count 369 N/A N/APower-on Hours N/A N/A N/APower Cycle Count 47 N/A N/AReallocated Sector Count N/A N/A N/ARaw Read Error Rate N/A N/A N/ADrive Temperature 30 N/A N/ADriver Rated Max Temperature N/A N/A N/AWrite Sectors TOT Count N/A N/A N/ARead Sectors TOT Count N/A N/A N/AInitial Bad Block Count N/A N/A N/A----------------------------------------------------- /var/run/log/vsandevicemonitord.log:WARNING - WRITE Average Latency on VSAN device <NAA disk name> has exceeded threshold value <IO latency threshold for disk> us <# of intervals with excessive IO latency> times.WARNING - Maximum log congestion on VSAN device <NAA disk name> <current intervals with excessive log congestion>/<intervals required to be unhealthy>WARNING - SMART health status for disk naa.##### is IMPENDING FAILURE.Refer the below articles for detailed symptoms and resolution.
vSAN -- DDH -- Disk Groups show as unmounted in the vSphere Web Client
Dying Disk Handling (DDH) in vSAN
vSAN hard disk health status show as Evacuated.
Inability to remove evacuated disk from Disk Group
VCF upgrade pre-check fails for the VSAN disk group due to disk failure
/var/run/log/vmkernel.log:2022-10-21T02:50:51.069Z cpu0:2098435)ScsiDeviceIO: 3501: Cmd(0x45a203564900) 0x28, cmdId.initiator=0x45223c91a7f0 CmdSN 0xaa97f from world 0 to dev "naa.#####" failed H:0x5 D:0x0 P:0x0 Aborted at driver layer. Cmd count Active:2 Queued:02022-10-21T04:41:13.494Z cpu0:2098435)ScsiDeviceIO: 3463: Cmd(0x45aa8ffdedc0) 0x28, CmdSN 0x2 from world 2102512 to dev "naa.#####" failed H:0xc D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.Warning: 'Errors occurred on the disk(s) of a vSAN host' on vSAN cluster and host.
When this is encountered, the vmkernel log will report a PDL (permanent device loss) or APD (all paths down) condition associated with a device.
The most common scenario is a disk going into PDL, and Virtual SAN will interpret this as a permanent condition and will mark the disk as permanently unavailable as IO will fail due to "not supported"
WARNING: NMP: nmp_PathDetermineFailure:2961: Cmd (0x2a) PDL error (0x5/0x25/0x0) - path vmhba2:C2:T2:L0 device naa.########- triggering path evaluationNMP: nmp_ThrottleLogForDevice:3286: Cmd 0x2a (0x439ee894cc00, 0) to dev "naa.########" on path "vmhba2:C2:T2:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0. Act:EVALLSOMCommon: IORETRYCompleteIO:495: Throttled: 0x439ee7ea0c00 IO type 304 (WRITE) isOdered:NO since 20392 msec status Not supportedWARNING: LSOM: LSOMEventNotify:6126: Virtual SAN device #####-#####-#####-#####-######### is under permanent error.
When an NVMe device has a controller failure event, in vmkernel.log it will be marked as "(state: 9 CONTROLLER_STATE_FAILED)"
vmkernel: cpu44:2097718)NVMEPSA:1345 taskMgmt:abort cmdId.initiator=0x4309ec4bc1c0 CmdSN 0x3b65426 world:0 controller 265 state:9 nsid:1 <== controller state is CONTROLLER_STATE_FAILED
NVMe drive on vSAN cluster disk management may show unhealthy/failed