VMware vSAN disk encounters medium errors but not failed out by vSAN

Products

VMware vSAN

Issue/Introduction

In the vmkernel log, messages similar to the below are seen:

2022-12-24T10:46:46.161Z cpu1:2098055)ScsiDeviceIO: 4176: Cmd(0x45be86a85348) 0x28, CmdSN 0x6cc49ca2 from world 0 to dev "naa.xxxxxxxxxxxxxxxx" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0 Medium Error, LBA: 1318295484

2022-12-24T10:46:46.663Z cpu8:2098047)ScsiDeviceIO: 4176: Cmd(0x45bec4adb1c8) 0x28, CmdSN 0x6cc4a979 from world 0 to dev "naa.xxxxxxxxxxxxxxxx" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0 Medium Error, LBA: 1318296508

2022-12-24T10:46:46.679Z cpu15:2098047)HPP: HppThrottleLogForDevice:1109: Error status H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0. from device naa.xxxxxxxxxxxxxxxx repeated 1 times,  hppAction = 1

2022-12-24T10:46:46.679Z cpu15:2098047)WARNING: HPP: HppThrottleLogForDevice:1144: Error status H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0. hppAction = 1

2022-12-24T10:46:46.679Z cpu15:2098047)ScsiDeviceIO: 4176: Cmd(0x45be57ecd648) 0x28, CmdSN 0x6cc4a985 from world 0 to dev "naa.xxxxxxxxxxxxxxxx" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0 Medium Error, LBA: 1318295484

2022-12-24T10:46:47.180Z cpu14:2098051)ScsiDeviceIO: 4176: Cmd(0x45be6e1c4488) 0x28, CmdSN 0x6cc4b59e from world 0 to dev "naa.xxxxxxxxxxxxxxxx" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0 Medium Error, LBA: 1318296508

This can also lead to checksum errors for objects if not handled immediately

2023-01-08T05:31:52.022Z cpu27:2099333)WARNING: LSOM: LSOMReadVerifyChecksum:4397: Throttled: Checksum error detected on component 3134ba63-####-####-####-########910, comp offset 210520969216 (computed CRC 0x0 != saved CRC 0x81bf8868 (faked: Y)

2023-01-08T05:31:56.297Z cpu19:2099333)WARNING: LSOM: LSOMScrubReadComplete:2827: Throttled: Checksum error detected on component 3134ba63-####-####-####-########910, comp offset 210520969216 (computed CRC 0x0 != saved CRC 0x81bf8868 (faked: Y))

2023-01-08T05:32:02.042Z cpu7:2099333)WARNING: LSOM: LSOMReadVerifyChecksum:4397: Throttled: Checksum error detected on component 3134ba63-####-####-####-########910, comp offset 210520969216 (computed CRC 0x0 != saved CRC 0x81bf8868 (faked: Y))

2023-01-05T19:03:27.576Z: [vSANCorrelator] 4820779458341us: [vob.vsan.dom.unrecoverableerror] vSAN detected an unrecoverable medium or checksum error for component 9c0cb35e-####-####-####-########610 on disk group 523a74b9-####-####-####-########f09.

2022-10-23T03:15:13.877Z: [vSANCorrelator] 22532796272us: [vob.vsan.dom.errorfixed] vSAN detected and fixed a medium or checksum error for component 938c5363-####-####-####-########610 on disk group 52106776-####-####-####-########862.

This will result in the guest operating system experiencing I/O errors, backup jobs failing, snapshots & cloning operations failing

Note: If the medium errors are detected in the metadata region vSAN will fail out the disk as per KB vSAN Disk Or Disk Group Fails With Medium Errors (81121), but when the medium errors are in the data region of the disk this doesn't happen.

Environment

VMware vSAN 8.0.x

VMware vSAN 7.0.x
VMware vSAN 6.x

Resolution

To be alerted of potential medium errors in your environment and you have Log Insight create an Alert Query to send you emails. Or any 3rd party log monitoring application.

Note: There is no method to prevent the logical(physical) failure of disk blocks as disks degrade over time, therefore it's best to be proactive and have log monitoring tools monitoring the logs for these types of events and take immediate action.

Set the Scrubber to check more frequently, by default it's set to 1 for once a year, by running esxcfg-advcfg -s <value> /VSAN/ObjectScrubsPerYear

For example, you can set the value to 2 for twice a year or 4 for quarterly. VMware recommends not going any higher than 12, which would scrub once a month, to avoid any potential performance issues in the environment. If performance issues are encountered in the environment decrease the value until the optimal balance of performance/scrubbing is found for the environment.

Note: Increasing the scrubber value adds more CPU/Memory cycles which could introduce latency to the environment, so don't make it too aggressive on how often the scrubber runs.

If medium errors are encountered in the data region of the disk and the disk wasn't failed out by vSAN you can do one of the following:

Using the preferred method, contact the hardware vendor and get the disk replaced.
Else:
1. If Deduplication is enabled:
  - Remove the failed disk with medium errors from the disk group and then add it back in. Upon adding the disk back to the disk group, the bad blocks are automatically reallocated by the disk for non-use, so they don't get used again.
2. If Deduplication is enabled:
  - Remove the disk group which contains the disk failing with medium errors from the host and recreate the disk group. Upon recreating the disk group the bad blocks are automatically reallocated by the disk for non-use, so they don't get used again.
    Note: This results in a resync to rebuild data due to the disk/disk group being removed and then recreated/added back.