Temporary/transient storage path loss on ESXi 8.0 could result in paths not coming back when using Cisco UCS and NFNIC
search cancel

Temporary/transient storage path loss on ESXi 8.0 could result in paths not coming back when using Cisco UCS and NFNIC

book

Article ID: 380321

calendar_today

Updated On: 03-15-2025

Products

VMware vSphere ESXi 8.0

Issue/Introduction

When performing SAN maintenance or while having an unexpected storage path outage, once the storage path is back up the NFNIC driver will be unable to add those paths back. You will observed the following sequence continuously in /var/log/vmkernel.log:

WARNING: nfnic: <2>: fnic_handle_report_lun: 1467: lun add failure! in_remove: 0 ioAllowed: 1
WARNING: nfnic: <2>: fnic_tport_event_handler: 2130: lunmap update failed,retry ..
nfnic: <2>: INFO: fnic_handle_report_lun: 1380: Report luns response for target_fcid : 0xaf01e0 target_id:283 num_luns 10
WARNING: nfnic: <2>: fnic_handle_report_lun: 1442: vmk_ScsiScanAndClaimPaths returned BUSY

You will also observe the following events related to StorageFPIN:

WARNING: StorageFPIN: 521: Failed to allocate memory.
WARNING: Heap: 3645: Heap storageFPINHeap already at its maximum size. Cannot expand.

VMkernel.log: 

2025-03-15T17:32:48.078Z Wa(180) vmkwarning: cpu20:2097755)WARNING: nfnic: <1>: fnic_handle_report_lun: 1465: lun add failure! in_remove: 0 ioAllowed: 1
2025-03-15T17:32:48.078Z Wa(180) vmkwarning: cpu20:2097755)WARNING: nfnic: <1>: fnic_tport_event_handler: 2129: lunmap update failed,retry ..
2025-03-15T17:32:48.078Z In(182) vmkernel: cpu20:2097755)nfnic: <1>: INFO: fnic_handle_report_lun: 1379: Report luns response for target_fcid : 0xa02a1 target_id:72 num_luns 8
2025-03-15T17:32:50.023Z Wa(180) vmkwarning: cpu8:2097762)WARNING: nfnic: <2>: fnic_handle_report_lun: 1442: vmk_ScsiScanAndClaimPaths returned BUSY
2025-03-15T17:32:50.078Z Wa(180) vmkwarning: cpu20:2097755)WARNING: nfnic: <1>: fnic_handle_report_lun: 1442: vmk_ScsiScanAndClaimPaths returned BUSY
2025-03-15T17:32:50.835Z In(182) vmkernel: cpu8:89469270)zdriver: _zmod_periodic:348:  #0- logs are not pulled
2025-03-15T17:32:51.023Z Wa(180) vmkwarning: cpu8:2097762)WARNING: nfnic: <2>: fnic_handle_report_lun: 1442: vmk_ScsiScanAndClaimPaths returned BUSY
2025-03-15T17:32:51.078Z Wa(180) vmkwarning: cpu20:2097755)WARNING: nfnic: <1>: fnic_handle_report_lun: 1442: vmk_ScsiScanAndClaimPaths returned BUSY
2025-03-15T17:32:51.835Z In(182) vmkernel: cpu8:89469270)zdriver: _zmod_periodic:348:  #0- logs are not pulled

2025-03-15T12:18:17.256Z Wa(180) vmkwarning: cpu27:2097450)WARNING: StorageFPIN: 521: Failed to allocate memory.
2025-03-15T13:18:17.587Z Wa(180) vmkwarning: cpu27:2097450)WARNING: StorageFPIN: 521: Failed to allocate memory.
2025-03-15T14:18:17.918Z Wa(180) vmkwarning: cpu21:2097450)WARNING: StorageFPIN: 521: Failed to allocate memory.
2025-03-15T15:18:18.249Z Wa(180) vmkwarning: cpu23:2097450)WARNING: StorageFPIN: 521: Failed to allocate memory.
 
2025-03-15T12:18:17.256Z Wa(180) vmkwarning: cpu27:2097450)WARNING: Heap: 3645: Heap storageFPINHeap already at its maximum size. Cannot expand.
2025-03-15T13:18:17.587Z Wa(180) vmkwarning: cpu27:2097450)WARNING: Heap: 3645: Heap storageFPINHeap already at its maximum size. Cannot expand.
2025-03-15T14:18:17.918Z Wa(180) vmkwarning: cpu21:2097450)WARNING: Heap: 3645: Heap storageFPINHeap already at its maximum size. Cannot expand.
2025-03-15T15:18:18.249Z Wa(180) vmkwarning: cpu23:2097450)WARNING: Heap: 3645: Heap storageFPINHeap already at its maximum size. Cannot expand.
 


You can check available FPINHeap with the following command.  A healthy host will around 5246448 bytes Available but an impacted host will show signifyingly less free space sometimes 16k bytes or less. 

esxcfg-info -a |grep -A3 storageFPINHeap|grep "Max Available"


Example:


Host-1 shows that it has run out of FPINheap. 

|----Max Available...................................416 bytes

Host-2 shows that we have not run out of Heap.

|----Max Available...................................3219872 bytes

Environment

vSphere 8.X, ESXi 8.X

Cause

FPIN (Fabric Performance Impact Notifications) capability was added to ESXi 8.0 U2 to be able to better understand fabric related issues. Due to a bug in the StorageFPIN code, when FPIN tries to allocate memory and is unable to, it can hold onto a reference count on the paths which prevents the Cisco NFNIC driver from being able to allocate new paths or re-establish existing ones.

Resolution

This is a known issue with both FPIN as well as how the Cisco NFNIC driver is coded to behave when there are path losses. The NFNIC driver does not save storage port bindings so when a storage path reestablishes after an outage or path loss, it will simply create brand new paths and increment target numbers. Because of the bug with FPIN keeping a reference count on those paths, the Cisco NFNIC driver is unable to establish new paths.

There are two permanent fixes for this issue:

  • A code fix to alter the FPIN open reference count behavior will be available in an upcoming ESXi 8.x release. (ESXi patch -- 8.0 P05 release)
  • Cisco will be releasing NFNIC driver 5.0.0.46 that will change the driver behavior so it will use fixed Target IDs: https://bst.cisco.com/quickview/bug/CSCwn00553

Either one of these fixes will resolve the issue of storage path recovery.


To workaround this issue, it is recommended to disable FPIN on ESXi 8.0 hosts, especially when using Cisco UCS and NFNIC:

 

For ESXi 8.0 U3 and newer, please use the following command:

esxcli storage fpin info set -e false

To confirm the setting:

esxcli storage fpin info get

NOTE: This setting change does not require a reboot on its own however if an ESXi host is already in a memory heap exhaustion state for storageFPINHeap then rebooting the host is required after making this setting change.


VMware ESXi 8.0 Update 3 Release Notes

Fabric Notification support for SAN clusters:

ESXi 8.0 Update 3 introduces support for Fabric Performance Impact Notifications Link Integrity (FPIN-LI). With FPIN-LI, the vSphere infrastructure layer can manage notifications from SAN switches or targets, identifying degraded SAN links and ensuring only healthy paths are used for storage devices. FPIN can also notify ESXi hosts for storage link congestion and errors.

Support for Fibre Channel Extended Link Services (FC-ELS):

With vSphere 8.0 Update 3, you can use the command esxcli storage fpin info set -e= <true/false> to activate or deactivate the Fabric Performance Impact Notification (FPIN). The command saves the FPIN activation to both ConfigStore and the VMkernel System Interface Shell and persists across ESXi reboots. This is enabled by both Broadcom’s lpfc and Marvell’s qlnativefc drivers.

For ESX 8.0 U2 and prior, use the following command:

vsish -e set /storage/fpin/info 0

NOTE: This vsish command is NOT persistent across reboots. Thus we recommend upgrading to ESXi 8.0 U3 and then disabling FPIN.