How to troubleshoot vSAN OSA disk issues

Products

VMware vSAN

Issue/Introduction

Symptoms:

There are different scenarios on how a vSAN Disk Group (DG) will respond to a certain failure:

In a case where Deduplication & Compression feature is enabled on the affected Disk Group:

Cache Disk failure --> The whole Disk Group will be down
Capacity Disk failure --> The whole Disk Group will be down

In a case where Deduplication & Compression feature is not enabled:

Cache Disk failure --> The whole Disk Group will be down
Capacity Disk failure --> Only the failed Disk will be down

Environment

VMware vSAN (OSA Cluster Model)

Cause

Resolution

How to identify whether your Disk Group is using Deduplication & Compression or not?

Via vCenter Web Client: Cluster --> Monitor --> Configure --> Services --> Section "Data Services" --> Space efficiency

Via ESXi SSH/Putty Session: Run the following command:

esxcli vsan storage list | grep -i dedup

If Deduplication is enabled you will see the following output for a Disk:
Deduplication: true

How to identify the failed disk?

In case Deduplication & Compression is not enabled:

vCenter Web Client: Cluster --> Monitor --> vSAN --> Skyline Health --> Physical disk --> Operation health

In case Deduplication & Compression is enabled:

Any Disk failure will cause the whole Disk Group to be offline.

In this situation, ensure to scroll to the right to view the "Operational State Description" to see which disk has failed. The disk marked as Permanent disk failure is the failed disk.

How to verify the disk status?

To confirm if the Disk or Disk group is currently mounted or still down: vCenter Web Client: Cluster --> Configure --> vSAN --> Disk Management

How to identify the physical location of the failed disks?

In case the failed disk is not in "Absent" state:

Run the following command using the Device identifier of the failed disk ( typically: naa.xxxx, eui.xxx, or mpx.xxx )

Example:

esxcli storage core device physical get -d naa.xxxxx
Physical Location: enclosure 1, slot 0

Alternatively: vCenter Web Client: Host with failed disk --> Configure --> Storage devices --> Mark the failed disk --> "TURN ON LED"

Note: This doesn't always work as it's dependent if the vSAN Controller Firmware supports this feature. In such a case, consider to engage your HW Vendor for assistance.

In case the failed disk is in "Absent" state:

Follow KB vSAN Deduplication enabled -- Identifying Failed Disk

Remark:

If the above KB doesn't help with identifying the failed disk, you can identify the physical slot in this state by clicking on "TURN ON LED" for all the working disks or identify the physical slots of all the working disks (by running esxcli storage core device physical get -d xxxxx),

to eliminate the physical location of all the working disks, hence, conclude the physical location of the failed disk.

As previously mentioned "TURN ON LED" doesn't always work as it's dependent if the vSAN Controller Firmware supports this feature.

In such a case, consider to engage your HW Vendor for assistance.

How to respond?

1.) Make sure there is no inaccessible object. Check via: vCenter Web Client: Cluster --> Monitor --> vSAN --> Virtual Objects

Alternatively log into any of the vSAN Hosts via SSH/Putty and run the following command:

esxcli vsan debug object health summary get

In case there are inaccessible objects, open a case with VMware by Broadcom Support

2.) Based on the cause of the disk failure, an activity might need to be done on the Host

Examples: Reboot of Host, replace/re-insert the failed disk, or recreate Disk Group.

As Best Practice, place the Host in Maintenance Mode with "Ensure accessibility" before proceeding with any activity

(Note: Run the Data Evacuation precheck, to understand if it will require to migrate data or not).

Important note:

If you placed the Host in Maintenance mode with "Ensure accessibility" for more than the the default ClomRepairDelay time of 60 minutes,

(can be viewed with: esxcfg-advcfg --get /VSAN/ClomRepairDelay).

vSAN will start a re-sync operation to rebuild the data residing on all Disk Groups on that host.

In case the activity will take more than the configured Repair time, you can increase the configured repair time temporarily to avoid unneeded re-sync operation.

(To change the Repair time, please see KB Changing the default repair delay time for a host failure in vSAN )

How to identify the time when the disk failed and if there are any SCSI errors reported on the failed disk?

1.) Open SSH/Putty session to the ESXi host and run the following command to identify the time-stamp when the disk failed:

egrep -i "perm|offline|unhealthy" /var/log/vobd.log

(You can also search on the disk UUID)

2022-10-12T20:03:18.694Z: [vSANCorrelator] 27997683071354us: [esx.problem.vob.vsan.lsom.devicerepair] Device xxxxxxxx is in offline state and is getting repaired

2022-10-12T20:46:00.111Z: [vSANCorrelator] 28000195517670us: [vob.vsan.lsom.diskerror] vSAN device xxxxxxxx is under permanent error.

Note: If the disk hasn't failed recently or if vobd is chatty you may need to look at older logs. To do this run zcat /var/run/log/vobd.*.gz|egrep -i "perm|offline|unhealthy"

2.) Run the following command to identify any read/write commands failing at the same time stamp collected from the previous step:

grep <disk device> /var/log/vmkernel.log

2022-10-12T20:03:14.424Z cpu5:2098330)ScsiDeviceIO: 4325: Cmd(0x45be74dae040) 0x28, CmdSN 0xd65263a8 from world 0 to dev "naa.xxxxxxxx" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0

2022-10-12T20:46:00.107Z cpu86:2098331)ScsiDeviceIO: 4277: Cmd(0x45de6e7527c0) 0x28, CmdSN 0xcebe from world 0 to dev "naa.xxxxxxxx" failed H:0xc D:0x0 P:0x0

Note: If the disk hasn't failed recently or if vmkernel is chatty you may need to look at older logs. To do this run zcat /var/run/log/vmkernel.*.gz|grep <disk device>

Common disk failure reasons:

(1) Disk is soft failed:

Troubleshooting steps:

Check KVM (iDRAC, iLO) for any issues with disks/controller
Check the logs for any SCSI error codes
Check controller driver/firmware to ensure they are not down to revision or in an unsupported combination by running esxcli vsan debug controller list -v=true then check the vSAN HCL if the drivers/firmware are down rev or not in a supported combination then upgrade them.

If no issues are found in the KVM or no SCSI error codes then this is a soft fail of a disk and reboot of the host may bring the disk(s)/DG back online place the host into maintenance mode with ensure accessibility, then reboot the host (If the disk(s)/DG comes back great, otherwise if it doesn't engage the hardware vendor for replacement/further assistance).

(2) Hardware issue: Valid sense data: 0x4 0x0 0x0

Example of SCSI error from log file: /var/run/log/vmkernel.log:
2021-01-05T08:37:16.337Z cpu26:2098033)ScsiDeviceIO: 3047: Cmd(0x45a3e27a1700) 0x2a, CmdSN 0x2238d from world 2960707 to dev "naa.xxxx" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x4 0x0 0x0.

Troubleshooting steps:

Disk needs to be replaced by the Hardware vendor

(3) Medium error: Valid sense data: 0x3 0x11 0x0

Unrecovered Read Error (URE) is a type of a medium error that occurs when the esxi host try to read from a bad block on the disk.
URE can occur in the metadata region or the data region.
If URE occurs in the data region of the disk open case with VMware vSAN support for further assistance.
If URE occurred in the metadata region, as of ESXi/vSAN 6.7 P03 and 7.0 Update 1 and newer a feature called autoDG Creation was introduced for All Flash DG and vSAN Skyline Health reports that the disk is unhealthy and will reallocate the blocks marking the bad blocks for non-use. See KB vSAN Disk Or Diskgroup Fails With Medium Errors for more details.

Example of scsi error from log file : /var/run/log/vmkernel.log:
2022-10-12T19:36:55.253Z cpu11:2098330)ScsiDeviceIO: 4325: Cmd(0x45bea479ec40) 0x28, CmdSN 0xfaf from world 0 to dev "naa.xxxx" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0

Troubleshooting steps:

In case "Hybrid vSAN" is used: Disks are HDDs, then the bad disk will need to be replaced by the Hardware vendor.

(4) vSAN Dying Disk Handling (= DDH) feature unmounts the bad disk or reports it unhealthy

DDH feature in vSAN continuously monitors the health of disks and disk groups in order to detect an impending disk failure or a poorly performing disk group (For more information about DDH, please review the following information for more detail:

KB Dying Disk Handling (DDH) in vSAN
Blog post: vSAN Degraded Device Handling

DDH unmounts the disk or reports it unhealthy in the following situations:

High write IO latency on one of the vSAN disks.
Maximum Log congestion threshold reached on one of the Disk Group.
IMPENDING FAILURE reported on one of the vSAN disks (We can see the health status of the disk using the following command: localcli storage core device smart get -d xxxx),

Example:
localcli storage core device smart get -d naa.xxxxx

SMART Data for Disk : naa.xxxxx
Parameter                     Value Threshold Worst
-----------------------------------------------------
Health Status                   IMPENDING FAILURE       N/A     N/A
Media Wearout Indicator         N/A     N/A     N/A
Write Error Count               0       N/A     N/A
Read Error Count                369     N/A     N/A
Power-on Hours                  N/A     N/A     N/A
Power Cycle Count               47      N/A     N/A
Reallocated Sector Count        N/A     N/A     N/A
Raw Read Error Rate             N/A     N/A     N/A
Drive Temperature               30      N/A     N/A
Driver Rated Max Temperature    N/A     N/A     N/A
Write Sectors TOT Count         N/A     N/A     N/A
Read Sectors TOT Count          N/A     N/A     N/A
Initial Bad Block Count         N/A     N/A     N/A
-----------------------------------------------------

Examples from log file: /var/run/log/vsandevicemonitord.log:
WARNING - WRITE Average Latency on VSAN device <NAA disk name> has exceeded threshold value <IO latency threshold for disk> us <# of intervals with excessive IO latency> times.
WARNING - Maximum log congestion on VSAN device <NAA disk name> <current intervals with excessive log congestion>/<intervals required to be unhealthy>
WARNING - SMART health status for disk naa.xxxxx is IMPENDING FAILURE.

Troubleshooting steps :

Check if the failed disk is facing any hardware or medium errors (reference the above steps).
Run the following command: esxtop, option "u" on the host with the failed disk, and check the "DAVG" for the failed disk to see if there is any high latency reported on that disk. If there's high latency seen, engage the hardware vendor. For more information on how to check DAVG using esxtop, reference the following KB Using esxtop to identify storage performance issues for ESX / ESXi (multiple versions).
Check the compatibility of the controller driver and firmware, and also check if the disk is a vSAN supported device and if its firmware version is supported (Reference vSAN HCL Link).
If there are no compatibility issues, engage your Hardware vendor to check for any firmware issues on the controller or disks.

(5) Read/write commands failing with Aborts/RETRY: H:0x5 & H:0xc

Example from log file: /var/run/log/vmkernel.log:
2022-10-21T02:50:51.069Z cpu0:2098435)ScsiDeviceIO: 3501: Cmd(0x45a203564900) 0x28, cmdId.initiator=0x45223c91a7f0 CmdSN 0xaa97f from world 0 to dev "naa.xxxx" failed H:0x5 D:0x0 P:0x0 Aborted at driver layer. Cmd count Active:2 Queued:02022-10-21T04:41:13.494Z cpu0:2098435)ScsiDeviceIO: 3463: Cmd(0x45aa8ffdedc0) 0x28, CmdSN 0x2 from world 2102512 to dev "naa.xxxxx" failed H:0xc D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.

Troubleshooting steps:

Check the compatibility of the controller driver and firmware, and also check if the disk is a vSAN supported device and if its firmware version is supported (Reference vSAN HCL Link).
If there are no compatibility issues, engage the hardware vendor to check for any firmware issues on the controller or disks.

(6) The disk was removed from the storage path.

When this is encountered, the vmkernel log will report a PDL (permanent device loss) or APD (all paths down) condition associated with a device.
The most common scenario is a disk going into PDL, and Virtual SAN will interpret this as a permanent condition and will mark the disk as permanently unavailable as IO will fail due to "not supported"
WARNING: NMP: nmp_PathDetermineFailure:2961: Cmd (0x2a) PDL error (0x5/0x25/0x0) - path vmhba2:C2:T2:L0 device naa.xxxxxxxxxxxxxxx - triggering path evaluation
NMP: nmp_ThrottleLogForDevice:3286: Cmd 0x2a (0x439ee894cc00, 0) to dev "naa.xxxxxxxxxx" on path "vmhba2:C2:T2:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0. Act:EVAL
LSOMCommon: IORETRYCompleteIO:495: Throttled: 0x439ee7ea0c00 IO type 304 (WRITE) isOdered:NO since 20392 msec status Not supported
WARNING: LSOM: LSOMEventNotify:6126: Virtual SAN device xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx is under permanent error.

- When the failure is due to APD instead of PDL (a comparatively rare scenario), the failure will be due to "status Not found"
- For more information on APD and PDL behavior, please see Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere (2004684)

Additional Information

vSAN Cluster Models: OSA, ESA

Using Deduplication and Compression in vSAN Cluster

For more information on decoding SCSI error messages, please see Interpreting SCSI sense codes in VMware ESXi and ESX