ESXi host becomes unresponsive following repeated PDL errors on paths

Products

VMware vSphere ESXi

Issue/Introduction

An ESXi host becomes unresponsive in vCenter/appears to lost network connectivity to vCenter

/var/log/vmkernel.log reports SCSI warnings against the paths for one or more devices, which indicate permanent device loss. For example, logs may report H:0x1 SCSI code ("no connection"), or logical unit not supported (H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0).

The device does not enter a Permanent Device loss state. Instead repeated failovers occur from one path to another, each of which reports a SCSI code which translates to Permanent Device Loss.

Environment

VMware vSphere ESXi 6.5
VMware vSphere ESXi 6.7
VMware vSphere ESXi 7.0.x
VMware vSphere ESXi 8.0.x

Cause

When failover is triggered from one path to another, ESXi tests the next path before sending I/O down the path.

With this issue, PDL is reported on the current path, e.g.:

vmkernel: cpu48:2100082 opID=4baa590c)NMP: nmp_ThrottleLogForDevice:3845: Cmd 0x5f/0x6 (0x45da91466f00, 0) to dev "naa.################################" on path "vmhba#:C#:T#:L##" Failed:
vmkernel: cpu48:2100082 opID=4baa590c)NMP: nmp_ThrottleLogForDevice:3852: H:0x1 D:0x0 P:0x0 . Act:FAILOVER. cmdId.initiator=0x430b02633dd0 CmdSN 0x186c3cd

ESXI tests the next available path and the test is successful:

vmkwarning: cpu48:2100082 opID=4baa590c)WARNING: NMP: nmp_DeviceRetryCommand:130: Device "naa.################################": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device

However, when ESXi then retries I/O to the LUN on that path it fails again with a H:0x1 (no connection) and failover is triggered to the next path:

vmkernel: cpu48:2100082 opID=4baa590c)NMP: nmp_ThrottleLogForDevice:3845: Cmd 0x5f/0x6 (0x45da91466f00, 0) to dev "naa.################################" on path "vmhba#:C#:T#:L2024-08-30T12:30:04.127Z Wa(180) vmkwarning: cpu17:2097913)WARNING: NMP: nmpDeviceAttemptFailover:644: Retry world failover device "naa.################################" - issuing command 0x45da91466f00
vmkwarning: cpu17:2097913)WARNING: NMP: nmpCompleteRetryForPath:356: Retry cmd 0x5f (0x45da91466f00) to dev "################################" failed on path "vmhba#:C#:T#:L##" H:0x1 D:0x0 P:0x0 .##" Failed:
vmkernel: cpu48:2100082 opID=4baa590c)NMP: nmp_ThrottleLogForDevice:3852: H:0x1 D:0x0 P:0x0 . Act:FAILOVER. cmdId.initiator=0x430b02633dd0 CmdSN 0x186c3cd

Again the new path is tested and the test succeeds and again subsequent push of I/O fails and the cycle repeats:

vmkwarning: cpu48:2100082 opID=4baa590c)WARNING: NMP: nmp_DeviceRetryCommand:130: Device "naa.################################": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device
vmkwarning: cpu48:2100082 opID=4baa590c)WARNING: NMP: nmp_DeviceStartLoop:790: NMP Device "naa.################################" is blocked. Not starting I/O from device.
vmkwarning: cpu17:2097913)WARNING: NMP: nmpDeviceAttemptFailover:644: Retry world failover device "naa.################################" - issuing command 0x45da91466f00
vmkwarning: cpu17:2097913)WARNING: NMP: nmpCompleteRetryForPath:356: Retry cmd 0x5f (0x45da91466f00) to dev "naa.################################" failed on path "vmhba##:C#:T#:L##" H:0x1 D:0x0 P:0x0.

As a result, the LUN never enters a PDL state, but repeated failovers are triggered between LUN paths.

Over time, such repeated failed access to one or more LUNs may degrade ESXi performance and lead to a host becoming unresponsive in vCenter.

Resolution

This issue is caused by an unstable LUN state on the array, where initial path tests are successful (preventing the LUN entering a Permanent Device Loss state), but all subsequent I/O is failed with Permanent Device Loss error.

Please engage your storage vendor support to resolve this issue.

Additional Information

On Permanent Device Loss, see Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere ESXI.

On the ESXi path failover sequence, see Understanding the storage path failover sequence in VMware ESXi native multipathing.