NO_CONNECT error leading to premature VM IO failures or PSOD

search cancel

NO_CONNECT error leading to premature VM IO failures or PSOD

book

Article ID: 313400

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

Virtual Machine (VM) IOs can prematurely fail or hit PSOD when local SCSI devices see IO failures due to NO_CONNECT error.

Environment

VMware vSphere ESXi 8.0.2
VMware vSphere ESXi 8.0.1

Cause

In case of multi-pathed local SCSI devices (active-active presentation) claimed by HPP, in a connection loss/ NO CONNECT scenario, VM IOs can fail prematurely instead of getting retried on the other path. This is due to a bug in ESXi HPP module in failover handling.
In case of local SCSI devices claimed by HPP, in APD(All paths down) scenario, if a complete storage rescan is triggered from VC/HOST, ESXi could PSOD due to a bug in HPP when trying to proactively fail rescan IOs due to APD.

Resolution

The two issues can happen only when devices are claimed by HPP and hit into IO failures due to NO_CONNECT/APD. Currently there is no resolution. This will be fixed in the future release.

Workaround:

To workaround the issue, please add claimrules to claim devices by NMP instead of HPP. Add vendor/model or driver or transport based claimrules as mentioned below:

esxcli storage core claimrule add -r 102 -t driver -D <DRIVER NAME> -P NMP

OR
esxcli storage core claimrule add -r 102 -t vendor -V <VENDOR NAME> -M <MODEL NAME> -P NMP

OR
esxcli storage core claimrule add -r 102 -t transport -R fc -P NMP

After adding claimrules, user can unclaim/reclaim device/s if not in use (steps added below) OR just reboot the system for claimrule to take effect:

esxcli storage core claimrule load
esxcli storage core claiming unclaim -t driver -D <DRIVER NAME>

esxcli storage core claiming unclaim -t vendor -V <VENDOR NAME> -M <MODEL NAME>

OR
esxcli storage core claiming unclaim -t device -d <device name>

esxcli storage core claimrule run

Feedback

thumb_up Yes

thumb_down No