VMware vSAN disk encounters medium errors but not failed out by vSAN
search cancel

VMware vSAN disk encounters medium errors but not failed out by vSAN

book

Article ID: 326850

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:

  • Unable to take snapshot, below is the error seen in vCenter > VM > Tasks

    "An error occurred while saving the snapshot: msg.changetracker.MIRRORCOPYSTATUS. An error occurred while taking a snapshot: msg.changetracker.MIRRORCOPYSTATUS"

    .

  • Back-up jobs are failing. 

 

Validation: 

  • In the vmkernel log, messages similar to the below are seen:

    2022-12-24T10:46:46.161Z cpu1:2098055)ScsiDeviceIO: 4176: Cmd(0x45be86a85348) 0x28, CmdSN 0x6cc49ca2 from world 0 to dev "naa.################" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0 Medium Error, LBA: 1318295484
    2022-12-24T10:46:46.663Z cpu8:2098047)ScsiDeviceIO: 4176: Cmd(0x45bec4adb1c8) 0x28, CmdSN 0x6cc4a979 from world 0 to dev "naa.################" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0 Medium Error, LBA: 1318296508
    2022-12-24T10:46:46.679Z cpu15:2098047)HPP: HppThrottleLogForDevice:1109: Error status H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0. from device naa.xxxxxxxxxxxxxxxx repeated 1 times,  hppAction = 1
    2022-12-24T10:46:46.679Z cpu15:2098047)WARNING: HPP: HppThrottleLogForDevice:1144: Error status H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0. hppAction = 1
    2022-12-24T10:46:46.679Z cpu15:2098047)ScsiDeviceIO: 4176: Cmd(0x45be57ecd648) 0x28, CmdSN 0x6cc4a985 from world 0 to dev "naa.################" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0 Medium Error, LBA: 1318295484
    2022-12-24T10:46:47.180Z cpu14:2098051)ScsiDeviceIO: 4176: Cmd(0x45be6e1c4488) 0x28, CmdSN 0x6cc4b59e from world 0 to dev "naa.################" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0 Medium Error, LBA: 1318296508

    2025-04-19T16:42:45.526Z cpu4:2106057 opID=bc9005a)Partition: 433: Failed read for "naa.################": I/O error

     

    2025-04-19T16:42:45.526Z cpu4:2106057 opID=bc9005a)Partition: 1109: Failed to read protective mbr on "naa.################" : I/O error

     

    2025-04-19T16:42:45.526Z cpu4:2106057 opID=bc9005a)WARNING: Partition: 1262: Partition table read from device naa.################ failed: I/O error

     

    2025-04-19T16:42:45.526Z cpu47:2099121)ScsiDeviceIO: 4062: Cmd(0x45a1d8a06580) 0x28, CmdSN 0x1 from world 2106019 to dev "naa.################" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x31 0x0.

     

  • This can also lead to checksum errors for objects if not handled immediately

    2023-01-08T05:31:52.022Z cpu27:2099333)WARNING: LSOM: LSOMReadVerifyChecksum:4397: Throttled: Checksum error detected on component 3134ba63-####-####-####-########910, comp offset 210520969216 (computed CRC 0x0 != saved CRC 0x81bf8868 (faked: Y)
    2023-01-08T05:31:56.297Z cpu19:2099333)WARNING: LSOM: LSOMScrubReadComplete:2827: Throttled: Checksum error detected on component 3134ba63-####-####-####-########910, comp offset 210520969216 (computed CRC 0x0 != saved CRC 0x81bf8868 (faked: Y))
    2023-01-08T05:32:02.042Z cpu7:2099333)WARNING: LSOM: LSOMReadVerifyChecksum:4397: Throttled: Checksum error detected on component 3134ba63-####-####-####-########910, comp offset 210520969216 (computed CRC 0x0 != saved CRC 0x81bf8868 (faked: Y))
    2023-01-05T19:03:27.576Z: [vSANCorrelator] 4820779458341us: [vob.vsan.dom.unrecoverableerror] vSAN detected an unrecoverable medium or checksum error for component 9c0cb35e-####-####-####-########610 on disk group 523a74b9-####-####-####-########f09.
    2022-10-23T03:15:13.877Z: [vSANCorrelator] 22532796272us: [vob.vsan.dom.errorfixed] vSAN detected and fixed a medium or checksum error for component 938c5363-####-####-####-########610 on disk group 52106776-####-####-####-########862.

    This will result in the guest operating system experiencing I/O errors, backup jobs failing, snapshots & cloning operations failing

    Note: If the medium errors are detected in the metadata region vSAN will fail out the disk as per KB vSAN Disk Or Disk Group Fails With Medium Errors (81121), but when the medium errors are in the data region of the disk this doesn't happen.

Some different types of medium errors that can potentially be seen in the logs
0x3 0x3 0x0 - PERIPHERAL DEVICE WRITE FAULT
0x3 0x10 0x0 - ID CRC OR ECC ERROR
0x3 0x11 0x0 - Unrecovered read error
0x3 0x31 0x0 - Medium Format corruption

Environment

VMware vSAN (All Versions)

Resolution

To be alerted of potential medium errors in your environment and you have Log Insight create an Alert to send you emails. Or any 3rd party log monitoring application.

Note: There is no method to prevent the logical(physical) failure of disk blocks as disks degrade over time, therefore it's best to be proactive and have log monitoring tools monitoring the logs for these types of events and take immediate action.

Set the Scrubber to check more frequently, by default it's set to 1 for once a year, by running esxcfg-advcfg -s <value> /VSAN/ObjectScrubsPerYear

For example, you can set the value to 2 for twice a year or 4 for quarterly. VMware recommends not going any higher than 12, which would scrub once a month, to avoid any potential performance issues in the environment. If performance issues are encountered in the environment decrease the value until the optimal balance of performance/scrubbing is found for the environment.

Note: Increasing the scrubber value adds more CPU/Memory cycles which could introduce latency to the environment, so don't make it too aggressive on how often the scrubber runs.

If medium errors are encountered in the data region of the disk and the disk wasn't failed out by vSAN you can do one of the following:
  1. Using the preferred method, contact the hardware vendor and get the disk replaced.
  2. Else:
    1. If Deduplication is not enabled:
      • Perform a pre-check by navigating to vSAN Cluster > Monitor > Data Migration Pre-check. Under Pre-check Data Migration For, select OBJECT, then go to the problematic host and select the problematic disk (as seen in the logs) that needs to be removed from the disk group and re-added.


         
        • Remove the failed disk with medium errors from the disk group and then add it back. Upon adding the disk back to the disk group, the bad blocks are automatically reallocated by the disk for non-use, so they don't get used again.

    2. If Deduplication is enabled:
      • Remove the disk group which contains the disk failing with medium errors from the host and recreate the disk group. Upon recreating the disk group the bad blocks are automatically reallocated by the disk for non-use, so they don't get used again.
        Note: This results in a resync to rebuild data due to the disk/disk group being removed and then recreated/added back.

    3.  Backups may fail when you see these errors "vSAN detected an unrecoverable medium or checksum error for component", check which object is affected and create a new policy with checksum option disabled and apply that policy to Object only and not to the entire VM.