Temporary/transient storage path loss on Host could result in paths not coming back when using Cisco UCS and NFNIC

Products

VMware vSphere ESXi 8.0

Issue/Introduction

After performing a unmount and detach of the datastore, rescanning HBA task fails with error 'an error occurred while communicating with the remote host'.

When performing SAN maintenance or while having an unexpected storage path outage, the NFNIC driver will be unable to add paths back. The following sequence is recorded continuously in /var/log/vmkernel.log:

WARNING: nfnic: <2>: fnic_handle_report_lun: 1467: lun add failure! in_remove: 0 ioAllowed: 1
WARNING: nfnic: <2>: fnic_tport_event_handler: 2130: lunmap update failed,retry ..
nfnic: <2>: INFO: fnic_handle_report_lun: 1380: Report luns response for target_fcid : 0xaf01e0 target_id:283 num_luns 10
WARNING: nfnic: <2>: fnic_handle_report_lun: 1442: vmk_ScsiScanAndClaimPaths returned BUSY

Note the following events related to StorageFPIN:

WARNING: StorageFPIN: 521: Failed to allocate memory.
WARNING: Heap: 3645: Heap storageFPINHeap already at its maximum size. Cannot expand.

The following additional messages may also appear in /var/log/vmkernel.log:

YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu20:2097755)WARNING: nfnic: <1>: fnic_handle_report_lun: 1465: lun add failure! in_remove: 0 ioAllowed: 1
YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu20:2097755)WARNING: nfnic: <1>: fnic_tport_event_handler: 2129: lunmap update failed,retry ..
YYYY-MM-DDTHH:MM:SSZ In(182) vmkernel: cpu20:2097755)nfnic: <1>: INFO: fnic_handle_report_lun: 1379: Report luns response for target_fcid : 0xa02a1 target_id:72 num_luns 8
YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu8:2097762)WARNING: nfnic: <2>: fnic_handle_report_lun: 1442: vmk_ScsiScanAndClaimPaths returned BUSY
YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu20:2097755)WARNING: nfnic: <1>: fnic_handle_report_lun: 1442: vmk_ScsiScanAndClaimPaths returned BUSY
YYYY-MM-DDTHH:MM:SSZ In(182) vmkernel: cpu8:89469270)zdriver: _zmod_periodic:348:  #0- logs are not pulled
YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu8:2097762)WARNING: nfnic: <2>: fnic_handle_report_lun: 1442: vmk_ScsiScanAndClaimPaths returned BUSY
YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu20:2097755)WARNING: nfnic: <1>: fnic_handle_report_lun: 1442: vmk_ScsiScanAndClaimPaths returned BUSY
YYYY-MM-DDTHH:MM:SSZ In(182) vmkernel: cpu8:89469270)zdriver: _zmod_periodic:348:  #0- logs are not pulled

YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu27:2097450)WARNING: StorageFPIN: 521: Failed to allocate memory.
YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu27:2097450)WARNING: StorageFPIN: 521: Failed to allocate memory.
YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu21:2097450)WARNING: StorageFPIN: 521: Failed to allocate memory.
YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu23:2097450)WARNING: StorageFPIN: 521: Failed to allocate memory.
 
YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu27:2097450)WARNING: Heap: 3645: Heap storageFPINHeap already at its maximum size. Cannot expand.
YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu27:2097450)WARNING: Heap: 3645: Heap storageFPINHeap already at its maximum size. Cannot expand.
YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu21:2097450)WARNING: Heap: 3645: Heap storageFPINHeap already at its maximum size. Cannot expand.
YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu23:2097450)WARNING: Heap: 3645: Heap storageFPINHeap already at its maximum size. Cannot expand.

Check available FPINHeap with the following command. A healthy host will have around 5246448 bytes (5MB) available but an impacted host will show significantly less free space, sometimes 16k bytes or less. Anything less than 1MB of available FPIN heap memory is an indicator that the FPIN heap memory is exhausted for the host.
esxcfg-info -a |grep -A3 storageFPINHeap|grep "Max Available"Example:
Host-1 shows that it has run out of FPINheap.

|----Max Available...................................416 bytes

Host-2 shows that we have not run out of Heap.

|----Max Available...................................3219872 bytes

Environment

8.x

Cause

FPIN (Fabric Performance Impact Notifications) capability was added to ESXi 8.0 U2 to be able to better understand fabric related issues. Due to a bug in the StorageFPIN code, when FPIN tries to allocate memory and is unable to, it can hold onto a reference count on the paths which prevents the Cisco NFNIC driver from being able to allocate new paths or re-establish existing ones.

Resolution

This is a known issue with both FPIN as well as how the Cisco NFNIC driver is coded to behave when there are paths lost. The NFNIC driver does not save storage port bindings so when a storage path reestablishes after an outage or path loss, it will simply create brand new paths and increment target numbers. Because of the bug with FPIN keeping a reference count on those paths, the Cisco NFNIC driver is unable to establish new paths.

Fix:

There is currently only one fix for this issue:

A code fix to alter the FPIN open reference count behavior is now available in ESXi 8.0 U3e (build 24674464)

Cisco will be releasing NFNIC driver in the future that will change the NFNIC driver behavior so it will use fixed Target IDs, which will also be a considered a fix when released. This KB will be updated to reflect that release when available.

Workaround:

To workaround this issue, it is recommended to disable FPIN on ESXi 8.0 hosts, especially when using Cisco UCS and NFNIC:

ESXi 8.0 U3 and version below ESX 8.0 U3e (build 24674464)
- Use the following command:
  esxcli storage fpin info set -e false

- To confirm the setting:
  esxcli storage fpin info get

NOTE: This setting change does not require a reboot on its own however if an ESXi host is already in a memory heap exhaustion state for storageFPINHeap then rebooting the host is required after making this setting change.

- Fabric Notification support for SAN clusters:
  ESXi 8.0 Update 3 introduces support for Fabric Performance Impact Notifications Link Integrity (FPIN-LI). With FPIN-LI, the vSphere infrastructure layer can manage notifications from SAN switches or targets, identifying degraded SAN links and ensuring only healthy paths are used for storage devices. FPIN can also notify ESXi hosts for storage link congestion and errors.
- Support for Fibre Channel Extended Link Services (FC-ELS):
  With vSphere 8.0 Update 3, use the command esxcli storage fpin info set -e=<true/false> to activate or deactivate the Fabric Performance Impact Notification (FPIN). The command saves the FPIN activation to both ConfigStore and the VMkernel System Interface Shell and persists across ESXi reboots. This is enabled by both Broadcom’s lpfc and Marvell’s qlnativefc drivers.

ESXi 8.0 U2 and prior

- Use the following command:
  vsish -e set /storage/fpin/info 0

NOTE: This vsish command is NOT persistent across reboots. Thus we recommend upgrading to ESXi 8.0 U3 and then disabling FPIN or reboot the host first and then run the command to diable fpin