For some I/O workloads in a hybrid vSAN environment, an out of memory condition can occur in the vSAN layers resulting in failed I/O and vSAN marking the targeted diskgroup with a permanent error state
book
Article ID: 318117
calendar_today
Updated On:
Products
VMware vSAN
Issue/Introduction
Impact/Risks: Disks get marked as permanently failed by vSAN which can result in a disk group offline.
Symptoms: The following vmkernel entries may indicate the issue:
2019-03-07T03:19:32.170Z cpu22:2105320)WARNING: LSOMCommon: IORETRYQIoInt:1698: Throttled: Cannot create qEntry. Disk ########-####-####-####-########caff ; numQ'd: 29, numOutOrdr: 0, numOut: 22731, numSlbAlloc: 22758, maxCnt: 23276, failCnt: 1. 2019-03-07T03:19:32.170Z cpu22:2105320)WARNING: LSOMCommon: IORETRYSplitAndQIOs:2059: Throttled: Failed enqueuing an IO with status Out of memory 2019-03-07T03:19:32.170Z cpu22:2105320)WARNING: LSOMCommon: IORETRYParentIODoneCB:1782: Throttled: split status Out of memory 2019-03-07T03:19:32.170Z cpu22:2105320)WARNING: PLOG: PLOGPropagateErrorInt:2978: Permanent error event on ########-####-####-####-########caff 2019-03-07T03:19:32.170Z cpu33:2105491)LSOM: LSOMLogDiskEvent:7472: Disk Event permanent error for SSD ########-####-####-####-########caff (naa.55cd2e404b4d035f:2)
Environment
VMware vSAN 6.7.x VMware vSAN 6.5.x
Resolution
The below workaround has been included in the 6.7U3 code. For the permanent fix upgrade to 7.0 or higher.
Workaround: Applying this change on its own may help to mitigate the problem. This can be achieved in a production build by setting the maxQueudIos config option to 100000. After this value is set, all disk groups on the ESXi host will need to be unmounted and remounted for the change to become effective. Steps to implement the workaround:
Put one ESXi host into Maintenance Mode - using the Ensure Accessibility option.
Set the new value for maxQueueIos
esxcfg-advcfg --set 100000 /LSOM/maxQueudIos
Unmount each vSAN diskgroup on the ESXi host
# esxcli vsan storage diskgroup unmount -s "naa ID of the SSD cache disk"
Remount each vSAN diskgroup on the ESXi host
# esxcli vsan storage diskgroup mount -s "naa ID of the SSD cache disk"
Exit Maintenance Mode
Repeat steps 1-5 for all remaining hosts in the cluster