When performing SAN maintenance or while having an unexpected storage path outage, once the storage path is back up the NFNIC driver will be unable to add those paths back. You will observed the following sequence continuously in /var/log/vmkernel.log:
WARNING: nfnic: <2>: fnic_handle_report_lun: 1467: lun add failure! in_remove: 0 ioAllowed: 1
WARNING: nfnic: <2>: fnic_tport_event_handler: 2130: lunmap update failed,retry ..
nfnic: <2>: INFO: fnic_handle_report_lun: 1380: Report luns response for target_fcid : 0xaf01e0 target_id:283 num_luns 10
WARNING: nfnic: <2>: fnic_handle_report_lun: 1442: vmk_ScsiScanAndClaimPaths returned BUSY
You will also observe the following events related to StorageFPIN:
WARNING: StorageFPIN: 521: Failed to allocate memory.
WARNING: Heap: 3645: Heap storageFPINHeap already at its maximum size. Cannot expand.
FPIN (Fabric Performance Impact Notifications) capability was added to ESXi 8.0 U2 to be able to better understand fabric related issues. Due to a bug in the StorageFPIN code, when FPIN tries to allocate memory and is unable to, it can hold onto a reference count on the paths which prevents the Cisco NFNIC driver from being able to allocate new paths or re-establish existing ones.
This is a known issue with both FPIN as well as how the Cisco NFNIC driver is coded to behave when there are path losses. The NFNIC driver does not save storage port bindings so when a storage path reestablishes after an outage or path loss, it will simply create brand new paths and increment target numbers. Because of the bug with FPIN keeping a reference count on those paths, the Cisco NFNIC driver is unable to establish new paths.
A code fix to alter the FPIN open reference count behavior will be available in an upcoming ESXi 8.x release.
Cisco will be releasing NFNIC driver 5.0.0.46 that will change the driver behavior so it will use fixed Target IDs: https://bst.cisco.com/quickview/bug/CSCwn00553
To workaround this issue, it is recommended to disable FPIN on ESXi 8.0 hosts, especially when using Cisco UCS and NFNIC:
esxcli storage fpin info set -e false
To confirm the setting:
esxcli storage fpin info get
Note: This setting change does not require a reboot on its own however if an ESXi host is already in a memory heap exhaustion state for storageFPINHeap then rebooting the host is required after making this setting change.