For some I/O workloads in a hybrid vSAN environment, an out of memory condition can occur in the vSAN layers resulting in failed I/O and vSAN marking the targeted diskgroup with a permanent error state
search cancel

For some I/O workloads in a hybrid vSAN environment, an out of memory condition can occur in the vSAN layers resulting in failed I/O and vSAN marking the targeted diskgroup with a permanent error state

book

Article ID: 318117

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Impact/Risks:
Disks get marked as permanently failed by vSAN which can result in a disk group offline.
 
Symptoms:
The following vmkernel entries may indicate the issue:

2019-03-07T03:19:32.170Z cpu22:2105320)WARNING: LSOMCommon: IORETRYQIoInt:1698: Throttled: Cannot create qEntry. Disk ########-####-####-####-########caff ; numQ'd: 29, numOutOrdr: 0, numOut: 22731, numSlbAlloc: 22758, maxCnt: 23276, failCnt: 1.

2019-03-07T03:19:32.170Z cpu22:2105320)WARNING: LSOMCommon: IORETRYSplitAndQIOs:2059: Throttled: Failed enqueuing an IO with status Out of memory
2019-03-07T03:19:32.170Z cpu22:2105320)WARNING: LSOMCommon: IORETRYParentIODoneCB:1782: Throttled: split status Out of memory
2019-03-07T03:19:32.170Z cpu22:2105320)WARNING: PLOG: PLOGPropagateErrorInt:2978: Permanent error event on ########-####-####-####-########caff
2019-03-07T03:19:32.170Z cpu33:2105491)LSOM: LSOMLogDiskEvent:7472: Disk Event permanent error for SSD ########-####-####-####-########caff (naa.55cd2e404b4d035f:2)


Environment

VMware vSAN 6.7.x
VMware vSAN 6.5.x

Resolution

The below workaround has been included in the 6.7U3 code.
For the permanent fix upgrade to 7.0 or higher.

Workaround:
Applying this change on its own may help to mitigate the problem. This can be achieved in a production build by setting the maxQueudIos config option to 100000. After this value is set, all disk groups on the ESXi host will need to be unmounted and remounted for the change to become effective.
Steps to implement the workaround:
 
  1. Put one ESXi host into Maintenance Mode - using the Ensure Accessibility option.
  2. Set the new value for maxQueueIos
    1. esxcfg-advcfg --set 100000 /LSOM/maxQueudIos
  3. Unmount each vSAN diskgroup on the ESXi host
    1. # esxcli vsan storage diskgroup unmount -s "naa ID of the SSD cache disk"
  4. Remount each vSAN diskgroup on the ESXi host
    1. # esxcli vsan storage diskgroup mount -s "naa ID of the SSD cache disk"
  5. Exit Maintenance Mode
  6. Repeat steps 1-5 for all remaining hosts in the cluster


Additional Information