Troubleshooting vSAN OSA disk issues

Products

VMware vSAN

Issue/Introduction

There are different scenarios on how a vSAN Disk Group (DG) will respond to a certain failure:

Deduplication and Compression Enabled?	Failure Type	Symptom
Yes	Cache Disk Failure	The whole Disk Group will be down
Yes	Capacity Disk Failure	The whole Disk Group will be down
No	Cache Disk Failure	The whole Disk Group will be down
No	Capacity Disk Failure	Only the failed Disk will be down

Environment

VMware vSAN (OSA Cluster Model)

Resolution

Identifying if your Disk Group is using Deduplication & Compression:

Using the vCenter Web Client
Using ESX CLI:

2.1 SSH to the effected ESXi host

2.2 Run the following command:

esxcli vsan storage list | grep -i dedup

If Deduplication is enabled you will see the following output for a Disk:
Deduplication: true

How to identify the failed disk?

In case Deduplication & Compression is not enabled:

vCenter Web Client: Cluster --> Monitor --> vSAN --> Skyline Health --> Physical disk --> Operation health

In case Deduplication & Compression is enabled:

Any Disk failure will cause the whole Disk Group to be offline. Please refer: vSAN Deduplication enabled -- Identifying Failed Disk

How to identify the physical location of the failed disks?

The physical disk can be located in different ways. Please refer the below articles.

Considerations and requirements for replacing the drives on vSAN cluster hosts.

Verify the data health on vSAN cluster: Refer: vSAN Health Service - Data Health – vSAN Object Health
vSAN is expected to start rebuilding the data on next available disk or host depend upon the storage policy to ensure the compliance. The default timer is 60 minutes and would start rebuilding the data after 60 minutes. If there is a situation or need to change the default repair timer, please refer the article Changing the default repair delay time for a host failure in vSAN
Verify the requirements for the disk replacement on vSAN cluster: Requirements when replacing disks in a vSAN cluster

Resolution/Workaround to different scenarios of vSAN OSA Disk Failures:

1. vSAN disk group may show failed or report errors. Upon verifying the logs '/var/run/log/vmkernel.log', it reports READ `(0x28),` WRITE `(0x2a)` and Medium (`0x3 0x11 0x0`) errors.

Unrecovered Read Error (URE) is a type of a medium error that occurs when the esxi host try to read from a bad block on the disk.
URE can occur in the metadata region or the data region.
If URE occurs in the data region of the disk open case with VMware vSAN support for further assistance.
If URE occurred in the metadata region, as of ESXi/vSAN 6.7 P03 and 7.0 Update 1 and newer a feature called autoDG Creation was introduced for All Flash DG and vSAN Skyline Health reports that the disk is unhealthy and will reallocate the blocks marking the bad blocks for non-use. See KB vSAN Disk Or Diskgroup Fails With Medium Errors for more details.

/var/log/vmkernel.log
2022-10-12T20:03:14.424Z cpu5:2098330)ScsiDeviceIO: 4325: Cmd(0x45be74dae040) 0x28, CmdSN 0xd65263a8 from world 0 to dev "naa.#####" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0
2022-10-12T20:46:00.107Z cpu86:2098331)ScsiDeviceIO: 4277: Cmd(0x45de6e7527c0) 0x28, CmdSN 0xcebe from world 0 to dev "naa.#####" failed H:0xc D:0x0 P:0x0

vSAN Disk Or Diskgroup Fails With Medium Errors (0x3 0x11)

2. vSAN disk group may show errors or report failure. Upon verifying and checking the logs, you may see disk degrade and permanent errors. Upon verifying the logs, you may see the below events.

egrep -i "perm|offline|unhealthy" /var/log/vobd.log

(You can also search on the disk UUID)

2022-10-12T20:03:18.694Z: [vSANCorrelator] 27997683071354us: [esx.problem.vob.vsan.lsom.devicerepair] Device ####### is in offline state and is getting repaired

2022-10-12T20:46:00.111Z: [vSANCorrelator] 28000195517670us: [vob.vsan.lsom.diskerror] vSAN device ###### is under permanent error.

Permanent disk failure in vSAN

3. vSAN Disk group may report failure/warnings on vSAN. Upon checking for the logs, you may see the Read/Write errors on the disks.

The /var/run/log/vmkernel.log would show the events as below with sense data H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x0 0x0.

/var/run/log/vmkernel.log:
2021-01-05T08:37:16.337Z cpu26:2098033)ScsiDeviceIO: 3047: Cmd(0x45a3e27a1700) 0x2a, CmdSN 0x2238d from world 2960707 to dev "naa.#####" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x0 0x0.

Please refer the KB for more information on the issue and fix. vSAN Disk or group may report failure on vSAN

4. vSAN Dying Disk Handling (= DDH) feature unmounts the bad disk or reports it unhealthy and report impending failure.

DDH feature in vSAN continuously monitors the health of disks and disk groups in order to detect an impending disk failure or a poorly performing disk group (For more information about DDH, please review the following information for more detail:

DDH unmounts the disk or reports it unhealthy in the following situations:

- High write IO latency on one of the vSAN disks.
- Maximum Log congestion threshold reached on one of the Disk Group.
- IMPENDING FAILURE reported on one of the vSAN disks (We can see the health status of the disk using the following command: localcli storage core device smart get -d #####),

Example:
localcli storage core device smart get -d naa.#####

SMART Data for Disk : naa.#####
Parameter                     Value Threshold Worst
-----------------------------------------------------
Health Status                   IMPENDING FAILURE       N/A     N/A
Media Wearout Indicator         N/A     N/A     N/A
Write Error Count               0       N/A     N/A
Read Error Count                369     N/A     N/A
Power-on Hours                  N/A     N/A     N/A
Power Cycle Count               47      N/A     N/A
Reallocated Sector Count        N/A     N/A     N/A
Raw Read Error Rate             N/A     N/A     N/A
Drive Temperature               30      N/A     N/A
Driver Rated Max Temperature    N/A     N/A     N/A
Write Sectors TOT Count         N/A     N/A     N/A
Read Sectors TOT Count          N/A     N/A     N/A
Initial Bad Block Count         N/A     N/A     N/A
-----------------------------------------------------

Examples from log file: /var/run/log/vsandevicemonitord.log:

WARNING - WRITE Average Latency on VSAN device <NAA disk name> has exceeded threshold value <IO latency threshold for disk> us <# of intervals with excessive IO latency> times.

WARNING - Maximum log congestion on VSAN device <NAA disk name> <current intervals with excessive log congestion>/<intervals required to be unhealthy>

WARNING - SMART health status for disk naa.##### is IMPENDING FAILURE.

Refer the below articles for detailed symptoms and resolution.

vSAN -- DDH -- Disk Groups show as unmounted in the vSphere Web Client

Dying Disk Handling (DDH) in vSAN

vSAN hard disk health status show as Evacuated.

Inability to remove evacuated disk from Disk Group

VCF upgrade pre-check fails for the VSAN disk group due to disk failure

5. Read/write commands failing with Aborts/RETRY: H:0x5 & H:0xc

Example from log file: /var/run/log/vmkernel.log:

2022-10-21T02:50:51.069Z cpu0:2098435)ScsiDeviceIO: 3501: Cmd(0x45a203564900) 0x28, cmdId.initiator=0x45223c91a7f0 CmdSN 0xaa97f from world 0 to dev "naa.#####" failed H:0x5 D:0x0 P:0x0 Aborted at driver layer. Cmd count Active:2 Queued:0

2022-10-21T04:41:13.494Z cpu0:2098435)ScsiDeviceIO: 3463: Cmd(0x45aa8ffdedc0) 0x28, CmdSN 0x2 from world 2102512 to dev "naa.#####" failed H:0xc D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.

Please refer the article below for the more information on these aborts.

Warning: 'Errors occurred on the disk(s) of a vSAN host' on vSAN cluster and host.

6. vSAN disk group may show warning and the disk was removed from the storage path.

When this is encountered, the vmkernel log will report a PDL (permanent device loss) or APD (all paths down) condition associated with a device.
The most common scenario is a disk going into PDL, and Virtual SAN will interpret this as a permanent condition and will mark the disk as permanently unavailable as IO will fail due to "not supported"

WARNING: NMP: nmp_PathDetermineFailure:2961: Cmd (0x2a) PDL error (0x5/0x25/0x0) - path vmhba2:C2:T2:L0 device naa.########- triggering path evaluation
NMP: nmp_ThrottleLogForDevice:3286: Cmd 0x2a (0x439ee894cc00, 0) to dev "naa.########" on path "vmhba2:C2:T2:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0. Act:EVAL
LSOMCommon: IORETRYCompleteIO:495: Throttled: 0x439ee7ea0c00 IO type 304 (WRITE) isOdered:NO since 20392 msec status Not supported
WARNING: LSOM: LSOMEventNotify:6126: Virtual SAN device #####-#####-#####-#####-######### is under permanent error.

- When the failure is due to APD instead of PDL (a comparatively rare scenario), the failure will be due to "status Not found"
- For more information on APD and PDL behavior, please see Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere (2004684)

7. NVMe Controller device failure

When an NVMe device has a controller failure event, in vmkernel.log it will be marked as "(state: 9 CONTROLLER_STATE_FAILED)"

vmkernel: cpu44:2097718)NVMEPSA:1345 taskMgmt:abort cmdId.initiator=0x4309ec4bc1c0 CmdSN 0x3b65426 world:0 controller 265 state:9 nsid:1 <== controller state is CONTROLLER_STATE_FAILED

NVMe drive on vSAN cluster disk management may show unhealthy/failed

8. Instructions of replacing the drive on vSAN cluster.

Identifying and replacing a failed cache or capacity disk in vSAN OSA disk group when vSAN deduplication is enabled