This KB is intended to assist in identifying if a vSAN disk or disk group has failed due to medium errors detected within the metadata or dedupe metadata region of the disk, and assist in resolving future occurrences.
Impact/Risks:The process presented below has no additional impacts or risks. The underlying problem could result in a DU or DL situation if there was multiple disk groups experiencing the same issue, or if there was another cause of redundancy loss when the disk group failed before data was rebuilt by vSAN to another disk or disk group. There is no method to recover the disk group intact once the physical disk blocks have failed and led to the unrecovered read error.
Symptoms:You see the following messages in vmkernel.log
2020-08-12T13:16:13.170Z cpu1:1000341424)ScsiDeviceIO: SCSICompleteDeviceCommand:4267: Cmd(0x45490152eb40) 0x28, CmdSN 0x11 from world 0 to dev "mpx.vmhba0:C0:T3:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x10 0x02020-08-12T13:16:13.170Z cpu1:1000341424)LSOMCommon: IORETRY_handleCompletionOnError:1723: Throttled: 0x454bc05ff900 IO type 264 (READ) isOrdered:NO isSplit:NO isEncr:NO since 0 msec status Read error2020-08-21T06:57:24.333Z cpu0:1000341425)ScsiDeviceIO: SCSICompleteDeviceCommand:4267: Cmd(0x4549015c8840) 0x2a, CmdSN 0x6 from world 0 to dev "mpx.vmhba0:C0:T3:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x3 0x02020-08-21T06:57:24.333Z cpu0:1000341425)LSOMCommon: IORETRYCompleteIO:470: Throttled: 0x454beebff940 IO type 304 (WRITE) isOrdered:NO isSplit:YES isEncr:NO since 0 msec status Write error2019-11-03T11:16:06.462Z cpu56:66446)NMP: nmp_ThrottleLogForDevice:3616: Cmd 0x28 (0x439dc176a8c0, 0) to dev "mpx.vmhba0:C2:T1:L0" on path "vmhba0:C2:T1:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0. Act:NONE
2019-11-03T11:16:06.462Z cpu56:66446)ScsiDeviceIO: 3015: Cmd(0x439dc176a8c0) 0x28, CmdSN 0x19b2 from world 0 to dev "mpx.vmhba0:C2:T1:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.
2019-11-03T11:16:06.462Z cpu56:66446)LSOMCommon: IORETRY_handleCompletionOnError:2461: Throttled: 0x439862b7e640 IO type 264 (READ) isOrdered:NO isSplit:YES isEncr:YES since 23 msec status I/O error
These may be followed by the following:
2019-11-03T11:16:06.462Z cpu46:66973)WARNING: PLOG: PLOGPropagateErrorInt:2821: Permanent error event on ########-####-####-####-########5b18
And if deduplication and compression are enabled with these similar messages as well:
2019-11-03T11:16:06.462Z cpu3:67299)WARNING: PLOG: DDPCompleteDDPWrite:2992: Throttled: DDP write failed I/O error callback
[email protected]#0.0.0.1
2019-11-03T11:16:06.462Z cpu3:67299)WARNING: PLOG: PLOGDDPCallbackFn:234: Throttled: DDP write failed I/O error
2019-11-03T11:16:06.462Z cpu46:66973)WARNING: PLOG: PLOGPropagateError:2880: DDP: Propagating error state from original device mpx.vmhba0:C2:T2:L0:2
2019-11-03T11:16:06.462Z cpu46:66973)WARNING: PLOG: PLOGPropagateError:2921: DDP: Propagating error state to MDs in device naa.5000cca09b0136c4:2
2019-11-03T11:16:06.462Z cpu13:11600681)LSOM: LSOMEventNotify:6734: Throttled: Event 2: waiting for mount helper for disk ########-####-####-####-########5b18
2019-11-03T11:16:06.462Z cpu13:11600681)LSOM: LSOMLogDiskEvent:5759: Disk Event permanent error propagated for MD ########-####-####-####-########d291 (mpx.vmhba0:C2:T1:L0:2)
2019-11-03T11:16:06.462Z cpu13:11600681)WARNING: LSOM: LSOMEventNotify:6886: Virtual SAN device ########-####-####-####-########d291 is under propagated permanent error.
2019-11-03T11:16:06.462Z cpu13:11600681)LSOM: LSOMLogDiskEvent:5759: Disk Event permanent error propagated for SSD ########-####-####-####-########3e0e (naa.5000cca09b0136c4:2)
2019-11-03T11:16:06.462Z cpu13:11600681)WARNING: LSOM: LSOMEventNotify:6886: Virtual SAN device ########-####-####-####-########3e0e is under propagated permanent error.
Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.