vSAN disks may be marked as permanently failed when using firmware HPG0/HPG1/HPG2 on specific SSD models
search cancel

vSAN disks may be marked as permanently failed when using firmware HPG0/HPG1/HPG2 on specific SSD models

book

Article ID: 319916

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:

Disks or Disk-Groups may be marked as under permanent error with vmkernel.log logging messages relating to Partition module failing memory allocations.

Environment

VMware vSAN 6.x
VMware vSAN 7.x
VMware vSAN 8.x

Cause

When queried, some models of HPE SSDs using firmware versions HPG0, HPG1 or HPG2 may return an invalid response that results in the device being marked as format corrupt and accordingly vSAN marks this as being under permanent error.

From vmkernel.log:
2020-01-12T18:13:05.071Z cpu7:2102011 opID=10005d0c)WARNING: Partition: 1261: Partition table read from device mpx.vmhba0:C0:T66:L0 failed: Out of memory
2020-01-12T18:13:05.072Z cpu2:2100632)LSOMCommon: IORETRYCompleteIO:463: Throttled:  0x459bf168bcc0 IO type 296 (READ) isOrdered:NO isSplit:NO isEncr:NO since 0 msec status I/O error
2020-01-12T18:13:05.072Z cpu5:2099468)WARNING: PLOG: PLOGPropagateError:3061: DDP: Propagating error state from original device xxxxxxx-1234-9876-0fd1-xxxxxxxxxx
2020-01-12T18:13:05.072Z cpu5:2099468)WARNING: PLOG: PLOGPropagateError:3103: DDP: Propagating error state to MDs in device xxxxxxx-9876-f452-ee27-xxxxxxxxxx
2020-01-12T18:13:05.072Z cpu5:2099468)WARNING: PLOG: PLOGPropagateErrorInt:3018: Error/unhealthy propagate event on xxxxxxx-abc1-cf12-7209-xxxxxxxxxx
2020-01-12T18:13:05.072Z cpu21:2100270)LSOM: LSOMLogDiskEvent:7509: Disk Event permanent error propagated for MD xxxxxxx-abc1-cf12-7209-xxxxxxxxxx (mpx.vmhba1:C0:T66:L0:2)
2020-01-12T18:13:05.072Z cpu5:2099468)WARNING: PLOG: PLOGPropagateErrorInt:3002: Permanent error event on xxxxxxx-1234-9876-0fd1-xxxxxxxxxx
2020-01-12T18:13:05.072Z cpu21:2100270)WARNING: LSOM: LSOMEventNotify:7763: vSAN device xxxxxxx-abc1-cf12-7209-xxxxxxxxxx is under propagated permanent error.
2020-01-12T18:13:05.072Z cpu5:2099468)WARNING: PLOG: PLOGPropagateErrorInt:3018: Error/unhealthy propagate event on xxxxxxx-78ef-3ef6-6ef2-xxxxxxxxxx
2020-01-12T18:13:05.072Z cpu5:2099468)WARNING: PLOG: PLOGPropagateErrorInt:3018: Error/unhealthy propagate event on xxxxxxx-ce12-1245-e378-xxxxxxxxxx
2020-01-12T18:13:05.072Z cpu5:2099468)WARNING: PLOG: PLOGPropagateErrorInt:3018: Error/unhealthy propagate event on xxxxxxx-92ee-17e5-60c4-xxxxxxxxxx
2020-01-12T18:13:05.072Z cpu5:2099468)WARNING: PLOG: PLOGPropagateErrorInt:3018: Error/unhealthy propagate event on xxxxxxx-9876-f452-ee27-xxxxxxxxxx
2020-01-12T18:13:05.072Z cpu21:2100270)LSOM: LSOMLogDiskEvent:7509: Disk Event permanent error for MD xxxxxxx-1234-9876-0fd1-xxxxxxxxxx (mpx.vmhba0:C0:T66:L0:2)
2020-01-12T18:13:05.072Z cpu21:2100270)WARNING: LSOM: LSOMEventNotify:7752: vSAN device xxxxxxx-1234-9876-0fd1-xxxxxxxxxx is under permanent error.
2020-01-12T18:13:05.072Z cpu21:2100270)LSOM: LSOMLogDiskEvent:7509: Disk Event permanent error propagated for MD xxxxxxx-78ef-3ef6-6ef2-xxxxxxxxxx (mpx.vmhba1:C0:T65:L0:2)
2020-01-12T18:13:05.072Z cpu21:2100270)WARNING: LSOM: LSOMEventNotify:7763: vSAN device xxxxxxx-78ef-3ef6-6ef2-xxxxxxxxxx is under propagated permanent error.
2020-01-12T18:13:05.072Z cpu21:2100270)LSOM: LSOMLogDiskEvent:7509: Disk Event permanent error propagated for MD xxxxxxx-ce12-1245-e378-xxxxxxxxxx (mpx.vmhba0:C0:T65:L0:2)
2020-01-12T18:13:05.072Z cpu21:2100270)WARNING: LSOM: LSOMEventNotify:7763: vSAN device xxxxxxx-ce12-1245-e378-8d3c84d518e7 is under propagated permanent error.
2020-01-12T18:13:05.072Z cpu21:2100270)LSOM: LSOMLogDiskEvent:7509: Disk Event permanent error propagated for MD xxxxxxx-92ee-17e5-60c4-xxxxxxxxxx (mpx.vmhba1:C0:T64:L0:2)
2020-01-12T18:13:05.072Z cpu21:2100270)WARNING: LSOM: LSOMEventNotify:7763: vSAN device xxxxxxx-92ee-17e5-60c4-xxxxxxxxxx is under propagated permanent error.
2020-01-12T18:13:05.072Z cpu21:2100270)LSOM: LSOMLogDiskEvent:7509: Disk Event permanent error propagated for SSD xxxxxxx-9876-f452-ee27-xxxxxxxxxx (naa.xxxxxxxxxxxxx:2)
2020-01-12T18:13:05.072Z cpu21:2100270)WARNING: LSOM: LSOMEventNotify:7763: vSAN device xxxxxxx-9876-f452-ee27-xxxxxxxxxx is under propagated permanent error.
2020-01-12T18:13:05.072Z cpu7:2102011 opID=10005d0c)WARNING: ScsiDeviceIO: 10109: Mode Sense cmd reported block size 512, does not match the current logical block size 3221925504(with physical block size 3221925504) for device.
2020-01-12T18:13:05.072Z cpu7:2102011 opID=10005d0c)WARNING: ScsiDeviceIO: 10111: The device mpx.vmhba0:C0:T66:L0 is marked format corrupt.


If the impacted device is a Cache-tier device or Capacity-tier device in a deduplication-enabled cluster then the Disk-Group is impacted and needs to be remediated, otherwise just the affected device should be remediated.

Resolution

Upgrade the firmware on all SSDs of affected models to firmware version HPG3 or later.
Upgrade the firmware for HPE Solid State Drive models according to the HPE Model Number.

Drive firmware HPG3 for VK000240GWSRQ, VK000480GWSRR, VK000960GWSRT, VK001920GWSRU and VK003840GWSRV drives:
VMware: https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_4dd924be589d454db44f73e6d7

Drive firmware HPG2 for MK000480GWSSC, MK000960GWSSD, MK001920GWSSE, MK003840GWSSF, VK000240GWSRQ, VK000960GWSRT, VK001920GWSRU, VK003840GWSRV, MK000480GWSSC, MK000960GWSSD and MK001920GWSSE drives:
VMware: https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_907cb657605141fba938906fed

Note: For detailed information on Drive firmware models, please refer to the HPE advisory link.

Workaround:
If this specific issue has been encountered then there is likely no issue with the disk/Disk-group and these can be restored to a functional state by unmounting and remounting them via the CLI:
# esxcli vsan storage diskgroup unmount -u <disk/diskGroup-UUID>
# esxcli vsan storage diskgroup mount -u <disk/diskGroup-UUID>

The information relating to the disk/Disk-Group can be found using below command:
# esxcli vsan storage list

Additional Information

Further information relating to this issue can be found in the associated HPE advisory:
** CRITICAL ** Online HDD/SSD Flash Component for ESXi - VK000240GWSRQ, VK000480GWSRR, VK000960GWSRT, VK001920GWSRU, VK003840GWSRV Drives

Impact/Risks:
Data may be in a reduced redundancy or inaccessible state while the disk/Disk-group is not available.