VMware vSAN disk encounters medium errors but not failed out by vSAN

Products

VMware vSAN

Issue/Introduction

Symptoms:

Unable to take Snapshot, below is the error seen in vCenter > VM > Tasks
"An error occurred while saving the snapshot: msg.changetracker.MIRRORCOPYSTATUS. An error occurred while taking a snapshot: msg.changetracker.MIRRORCOPYSTATUS"
Backup Job(s) for one or more VMs (= vmdk(s)) failing with one or more of the following messages:

Virtual disk for querying changed areas cannot be accessed. SOAP 1.1 fault "":ServerFaultCode[no subcode]

"Error caused by file /vmfs/volumes/vsan:########-#######/########-####-####-#########/####.vmdk" Detail:

Backup task failed with error: type: kVixError error_msg: "[1-4-212] [Code 1] Unknown error" detailed_error_msg

{ message_guid: "1-4-212" short_message_string: "VirtualDiskError" }

Query changed areas for disk #### failed with error [[kVSphereError]: Virtual disk for querying changed areas cannot be accessed.

SOAP 1.1 fault "":ServerFaultCode[no subcode] "Error caused by file /vmfs/volumes/vsan:########-#######/########-####-####-#########/####.vmdk"

Detail:], previous_change_id [## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##/4073], backing up entire disk

Query changed areas for disk #### (filePath: [vSANDatastore] ########-#######/####.vmdk) with capacity: 107374182400

and previous_change_id [## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##/4073] returned total number of disk areas: 1 total disk area size: 107374182400

Querying VM disk [vSANDatastore] ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ##/####.vmdk for allocated blocks

Resync of an object doesn't complete as the resync is stuck in a loop trying to read from bad disk sectors

Validation:

In the /var/log/vmkernel.log, one or more of the following messages are observed:
YYYY-MM-DDTHH:MM:SS cpu1:2098055)ScsiDeviceIO: 4176: Cmd(0x45be86a85348) 0x28, CmdSN 0x6cc49ca2 from world 0 to dev "################" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0 Medium Error, LBA: 1318295484

YYYY-MM-DDTHH:MM:SS cpu15:2098047)HPP: HppThrottleLogForDevice:1109: Error status H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0. from device ################ repeated 1 times, hppAction = 1

YYYY-MM-DDTHH:MM:SS cpu15:2098047)WARNING: HPP: HppThrottleLogForDevice:1144: Error status H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0. hppAction = 1

YYYY-MM-DDTHH:MM:SS cpu4:2106057 opID=bc9005a)Partition: 433: Failed read for "################": I/O error

YYYY-MM-DDTHH:MM:SS cpu4:2106057 opID=bc9005a)Partition: 1109: Failed to read protective mbr on "################" : I/O error

YYYY-MM-DDTHH:MM:SS cpu4:2106057 opID=bc9005a)WARNING: Partition: 1262: Partition table read from device ################ failed: I/O error

YYYY-MM-DDTHH:MM:SS cpu47:2099121)ScsiDeviceIO: 4062: Cmd(0x45a1d8a06580) 0x28, CmdSN 0x1 from world 2106019 to dev "################" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x31 0x0.
The occurrence of the following messages can also lead to Checksum errors for Objects:
YYYY-MM-DDTHH:MM:SS cpu27:2099333)WARNING: LSOM: LSOMReadVerifyChecksum:4397: Throttled: Checksum error detected on component ####-####-####-####-########, comp offset ######## (computed CRC 0x0 != saved CRC ###### (faked: Y)
YYYY-MM-DDTHH:MM:SS cpu19:2099333)WARNING: LSOM: LSOMScrubReadComplete:2827: Throttled: Checksum error detected on component ####-####-####-####-########, comp offset ######## (computed CRC 0x0 != saved CRC ###### (faked: Y))
YYYY-MM-DDTHH:MM:SS [vSANCorrelator] 4820779458341us: [vob.vsan.dom.unrecoverableerror] vSAN detected an unrecoverable medium or checksum error for component ####-####-####-####-######## on disk group ######-####-####-####-########.
YYYY-MM-DDTHH:MM:SS [vSANCorrelator] 22532796272us: [vob.vsan.dom.errorfixed] vSAN detected and fixed a medium or checksum error for component ####-####-####-####-######## on disk group ######-####-####-####-########

This will result in the Guest Operating system experiencing I/O errors, backup jobs failing, snapshots, resync, & cloning operations failing

Note: If the medium errors are detected in the metadata region vSAN will fail out the disk as per KB vSAN Disk Or Disk Group Fails With Medium Errors (81121), but when the medium errors are in the data region of the disk this doesn't happen.

Some different types of medium errors that can potentially be seen in the logs

0x3 0x3 0x0 - PERIPHERAL DEVICE WRITE FAULT

0x3 0x10 0x0 - ID CRC OR ECC ERROR

0x3 0x11 0x0 - Unrecovered read error

0x3 0x31 0x0 - Medium Format corruption

Environment

VMware vSAN (All Versions)

Resolution

To be alerted of potential medium errors in your environment and you have Log Insight create an Alert to send you emails. Or any 3rd party log monitoring application.

Note: There is no method to prevent the logical(physical) failure of disk blocks as disks degrade over time, therefore it's best to be proactive and have log monitoring tools monitoring the logs for these types of events and take immediate action.

Set the Scrubber to check more frequently, by default it's set to 1 for once a year, by running esxcfg-advcfg -s <value> /VSAN/ObjectScrubsPerYear

For example, you can set the value to 2 for twice a year or 4 for quarterly. VMware recommends not going any higher than 12, which would scrub once a month, to avoid any potential performance issues in the environment. If performance issues are encountered in the environment decrease the value until the optimal balance of performance/scrubbing is found for the environment.

Note: Increasing the scrubber value adds more CPU/Memory cycles which could introduce latency to the environment, so don't make it too aggressive on how often the scrubber runs.

If medium errors are encountered in the data region of the disk and the disk wasn't failed out by vSAN you can do one of the following:

Using the preferred method, contact the hardware vendor and get the disk replaced.
Else:
1. If Deduplication is not enabled:
  - Perform a pre-check by navigating to vSAN Cluster > Monitor > Data Migration Pre-check. Under Pre-check Data Migration For, select OBJECT, then go to the problematic host and select the problematic disk (as seen in the logs) that needs to be removed from the disk group and re-added.
    - Remove the failed disk with medium errors from the disk group and then add it back. Upon adding the disk back to the disk group, the bad blocks are automatically reallocated by the disk for non-use, so they don't get used again.
2. If Deduplication is enabled:
  - Remove the disk group which contains the disk failing with medium errors from the host and recreate the disk group. Upon recreating the disk group the bad blocks are automatically reallocated by the disk for non-use, so they don't get used again.
    Note: This results in a resync to rebuild data due to the disk/disk group being removed and then recreated/added back.
3. Backups, clone, or consolidation operations may fail when you see these errors "vSAN detected an unrecoverable medium or checksum error for component", check which object is affected and create a new policy with checksum option disabled and apply that policy to Object only and not to the entire VM.

Additional Information

Enable alert in vCenter for vSAN checksum errors detected in the host logs