ESXi 8.X storage paths fail to recover after maintenance or outage

Products

VMware vSphere ESX 8.x

Issue/Introduction

During FC maintenance (FC upgrades/Failovers) or after an unexpected storage path outage, the paths will remain dead on the host even when the storage path is expected to be back up.
Newly presented FC LUNs are not visible.
Unable to unmount/delete VMFS datastore which are backed by FC LUN.
Storage rescan of host will hang or it may fail with error: "An error occurred while communicating with the remote host."

/var/log/vmkernel.log

####-##-##T##:##:##.###Z Wa(180) vmkwarning: cpu50:22155124) WARNING: VMW_SATP_ALUA: satp_alua_getTargetPortInfo:190: Could not get page 83 INQUIRY data for path "vmhba0:C0:T##:L##" - Transient storage condition, suggest retry (195887294)
####-##-##T##:##:##.###Z Wa(180) vmkwarning: cpu7:22155124) WARNING: VMW_SATP_ALUA: satp_alua_getTargetPortInfo:190: Could not get page 83 INQUIRY data for path "vmhba0:C0:T##:L##" - Transient storage condition, suggest retry (195887294)

Post FC switch being brought up, the corresponding vmhba paths should report coming online, but instead paths stay down:

In /var/log/vmkernel.log reports logging similar to:

####-##-##T##:##:##.###Z Wa(180) vmkwarning: cpu59:2098620)WARNING: lpfc: lpfc_do_scr_ns_plogi:10301: vmhba# 3334 Delay fc port discovery for 10 seconds
####-##-##T##:##:##.###Z In(182) vmkernel: cpu59:2098620)lpfc: lpfc_cmpl_ct_cmd_fdmi:1830: vmhba# 0229 FDMI cmd 0211 failed, latt = 0 ulpStatus: x3, rid x20000004
####-##-##T##:##:##.###Z In(182) vmkernel: cpu44:2098620)lpfc: lpfc_issue_gidft:4096: vmhba# fc4 type 3
####-##-##T##:##:##.###Z In(182) vmkernel: cpu44:2098620)lpfc: lpfc_cmpl_els_prli:2425: vmhba# 0103 PRLI completes to NPort x81efc0 Data: x0 x81c800 x14 x0
####-##-##T##:##:##.###Z In(182) vmkernel: cpu44:2098620)lpfc: lpfc_cmpl_prli_prli_issue:1965: vmhba# 6082 FCP DID 81efc0 initiator 0 target 1
####-##-##T##:##:##.###Z Wa(180) vmkwarning: cpu44:2098620)WARNING: lpfc: lpfc_notify_paths_available:4531: vmhba# 3274 ScsiNotifyPathStateChangeAsyncSAdapter Num x0 TID x12, DID x81efc0.
####-##-##T##:##:##.###Z In(182) vmkernel: cpu44:2098620)lpfc: lpfc_cmpl_els_prli:2425: vmhba# 0103 PRLI completes to NPort x81ef80 Data: x0 x81c800 x14 x0

FPIN (Fabric performance Impact Notifications) memory allocation errors are also observed in the /var/log/vmkernel.log.

####-##-##T##:##:##.###Z Wa(180) vmkwarning: cpu43:2097930)WARNING: StorageFPIN: 521: Failed to allocate memory.
####-##-##T##:##:##.###Z Wa(180) vmkwarning: cpu51:2097930)WARNING: StorageFPIN: 521: Failed to allocate memory.
####-##-##T##:##:##.###Z Wa(180) vmkwarning: cpu51:2097930)WARNING: StorageFPIN: 521: Failed to allocate memory.
####-##-##T##:##:##.###Z Wa(180) vmkwarning: cpu111:234559778)WARNING: StorageFPIN: 521: Failed to allocate memory.
####-##-##T##:##:##.##Z Wa(180) vmkwarning: cpu111:234559778)WARNING: StorageFPIN: 286: Failed to allocate memory.
####-##-##T##:##:##.###Z Wa(180) vmkwarning: cpu83:234559864)WARNING: StorageFPIN: 521: Failed to allocate memory.

Environment

VMware vSphere ESXi 8.0U2
VMware vSphere ESXi 8.0U3

Cause

FPIN (Fabric Performance Impact Notifications) capability was added to ESXi 8.0 U2 to be able to better understand fabric related issues. Due to a bug in the StorageFPIN code, when FPIN tries to allocate memory and is unable to, it can hold onto a reference count on the paths which prevents the FC HBA driver from being able to allocate new paths or re-establish existing ones.

The available FPINHeap on host can be checked with the following command.
esxcfg-info -a |grep -A3 storageFPINHeap | grep -i Max |----Max Size........................................5247400 bytes |----Max Available...................................1120 bytes

For example, on on the following host 1 we have hit the issue and run out of FPINheap:
|----Max Available...................................1120 bytes

On host 2 we have not yet run out of Heap:
|----Max Available...................................3219872 bytes

Resolution

This is a known issue with both FPIN and how FC HBA driver behaves when there are path losses.
The FC HBA driver does not save storage port bindings so when a storage path reestablishes after an outage or path loss, it will simply create brand new paths and increment target numbers.
Due to this issue with FPIN keeping a reference count on those paths, the FC HBA driver is unable to establish new paths.

A code fix to alter the FPIN open reference count behavior would be available in upcoming ESXi patch -- 8.0 P05 release.

Workaround:

The workaround to get around this issue, it is recommended to disable FPIN on ESXi 8.0 hosts where FC storage is connected:

Process to disable FPIN:

On ESXi 8.0 U2:

Reboot the ESXi host to clear the out of memory condition.
Then to disable FPIN run following command in ESXi SSH: vsish -e set /storage/fpin/info 0
To confirm the setting run following command: vsish -e get /storage/fpin/info
Note: This command is not persistent across reboots.

On ESXi 8.0 U3:

To disable FPIN run following command in ESXi SSH : esxcli storage fpin info set -e false
To confirm the setting run following command: esxcli storage fpin info get Output: "FPIN Feature: false" means it is disabled [root@Pxxx:~] esxcli storage fpin info get FPIN Feature: false Total HBAs: xx FPIN Supported HBAs: xx
Note: This setting change does not require a reboot, however if an ESXi host is already in a memory heap exhaustion state for storageFPINHeap then rebooting the host is required after making this setting change.

Additional Information

For Cisco Servers using FNIC please see the following KB:

Temporary/transient storage path loss on ESXi 8.0 could result in paths not coming back when using Cisco UCS and NFNIC