In a case where Deduplication & Compression feature is enabled on the affected Disk Group:
In a case where Deduplication & Compression feature is not enabled:
In this article, we will be addressing the following:
How to identify whether your Disk Group is using Deduplication & Compression or not?
Via vCenter Web Client: Cluster --> Monitor --> Configure --> Services --> Section "Data Services" --> Space efficiency
esxcli vsan storage list | grep -i dedup
Deduplication: true
How to identify the failed disk?
In case Deduplication & Compression is not enabled:
vCenter Web Client: Cluster --> Monitor --> vSAN --> Skyline Health --> Physical disk --> Operation health
In case Deduplication & Compression is enabled:
Any Disk failure will cause the whole Disk Group to be offline.
How to verify the disk status?
To confirm if the Disk or Disk group is currently mounted or still down: vCenter Web Client: Cluster --> Configure --> vSAN --> Disk Management
How to identify the physical location of the failed disks?
In case the failed disk is not in "Absent" state:
Run the following command using the Device identifier of the failed disk ( typically: naa.xxxx, eui.xxx, or mpx.xxx )esxcli storage core device physical get -d naa.xxxxx
Physical Location: enclosure 1, slot 0
In case the failed disk is in "Absent" state:
Follow KB vSAN Deduplication enabled -- Identifying Failed Disk
Remark:esxcli storage core device physical get -d
xxxxx),Important note:
How to identify the time when the disk failed and if there are any SCSI errors reported on the failed disk?
1.) Open SSH/Putty session to the ESXi host and run the following command to identify the time-stamp when the disk failed:
egrep -i "perm|offline|unhealthy" /var/log/vobd.log
zcat /var/run/log/vobd.*.gz|egrep -i "perm|offline|unhealthy"
2.) Run the following command to identify any read/write commands failing at the same time stamp collected from the previous step:
grep <disk device> /var/log/vmkernel.log
Note: If the disk hasn't failed recently or if vmkernel is chatty you may need to look at older logs. To do this run zcat /var/run/log/vmkernel.*.gz|grep <disk device>
Common disk failure reasons:
(1) Disk is soft failed:
Troubleshooting steps:
(2) Hardware issue: Valid sense data: 0x4 0x0 0x0
Example of SCSI error from log file: /var/run/log/vmkernel.log
:
2021-01-05T08:37:16.337Z cpu26:2098033)ScsiDeviceIO: 3047: Cmd(0x45a3e27a1700) 0x2a, CmdSN 0x2238d from world 2960707 to dev "naa.xxxx" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x0 0x0.
(3) Medium error: Valid sense data: 0x3 0x11 0x0
Example of scsi error from log file : /var/run/log/vmkernel.log:
2022-10-12T19:36:55.253Z cpu11:2098330)ScsiDeviceIO: 4325: Cmd(0x45bea479ec40) 0x28, CmdSN 0xfaf from world 0 to dev "naa.xxxx" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0
Troubleshooting steps:
(4) vSAN Dying Disk Handling (= DDH) feature unmounts the bad disk or reports it unhealthy
DDH unmounts the disk or reports it unhealthy in the following situations:
localcli storage core device smart get -d xxxx)
, localcli storage core device smart get -d naa.xxxxx
SMART Data for Disk : naa.xxxxx
Parameter Value Threshold Worst
-----------------------------------------------------
Health Status IMPENDING FAILURE N/A N/A
Media Wearout Indicator N/A N/A N/A
Write Error Count 0 N/A N/A
Read Error Count 369 N/A N/A
Power-on Hours N/A N/A N/A
Power Cycle Count 47 N/A N/A
Reallocated Sector Count N/A N/A N/A
Raw Read Error Rate N/A N/A N/A
Drive Temperature 30 N/A N/A
Driver Rated Max Temperature N/A N/A N/A
Write Sectors TOT Count N/A N/A N/A
Read Sectors TOT Count N/A N/A N/A
Initial Bad Block Count N/A N/A N/A
-----------------------------------------------------
Examples from log file: /var/run/log/vsandevicemonitord.log
:
WARNING - WRITE Average Latency on VSAN device <NAA disk name> has exceeded threshold value <IO latency threshold for disk> us <# of intervals with excessive IO latency> times.
WARNING - Maximum log congestion on VSAN device <NAA disk name> <current intervals with excessive log congestion>/<intervals required to be unhealthy>
WARNING - SMART health status for disk naa.xxxxx is IMPENDING FAILURE.
Troubleshooting steps :
(5) Read/write commands failing with Aborts/RETRY: H:0x5 & H:0xc
Example from log file: /var/run/log/vmkernel.log:
2022-10-21T02:50:51.069Z cpu0:2098435)ScsiDeviceIO: 3501: Cmd(0x45a203564900) 0x28, cmdId.initiator=0x45223c91a7f0 CmdSN 0xaa97f from world 0 to dev "naa.xxxx" failed H:0x5 D:0x0 P:0x0 Aborted at driver layer. Cmd count Active:2 Queued:02022-10-21T04:41:13.494Z cpu0:2098435)ScsiDeviceIO: 3463: Cmd(0x45aa8ffdedc0) 0x28, CmdSN 0x2 from world 2102512 to dev "naa.xxxxx" failed H:0xc D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.
Troubleshooting steps:
(6) The disk was removed from the storage path.
When this is encountered, the vmkernel log will report a PDL (permanent device loss) or APD (all paths down) condition associated with a device.
The most common scenario is a disk going into PDL, and Virtual SAN will interpret this as a permanent condition and will mark the disk as permanently unavailable as IO will fail due to "not supported"
WARNING: NMP: nmp_PathDetermineFailure:2961: Cmd (0x2a) PDL error (0x5/0x25/0x0) - path vmhba2:C2:T2:L0 device naa.600605b0099250d01da7c6a019312a53 - triggering path evaluation
NMP: nmp_ThrottleLogForDevice:3286: Cmd 0x2a (0x439ee894cc00, 0) to dev "naa.600605b0099250d01da7c6a019312a53" on path "vmhba2:C2:T2:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0. Act:EVAL
LSOMCommon: IORETRYCompleteIO:495: Throttled: 0x439ee7ea0c00 IO type 304 (WRITE) isOdered:NO since 20392 msec status Not supported
WARNING: LSOM: LSOMEventNotify:6126: Virtual SAN device 52a89ef4-6c3f-d16b-5347-c5354470b465 is under permanent error.