Temporary/transient storage path loss on ESXi 8.0 could result in paths not coming back when using Cisco UCS and NFNIC
search cancel

Temporary/transient storage path loss on ESXi 8.0 could result in paths not coming back when using Cisco UCS and NFNIC

book

Article ID: 380321

calendar_today

Updated On:

Products

VMware vSphere ESXi 8.0

Issue/Introduction

When performing SAN maintenance or while having an unexpected storage path outage, once the storage path is back up the NFNIC driver will be unable to add those paths back. You will observed the following sequence continuously in /var/log/vmkernel.log:

WARNING: nfnic: <2>: fnic_handle_report_lun: 1467: lun add failure! in_remove: 0 ioAllowed: 1
WARNING: nfnic: <2>: fnic_tport_event_handler: 2130: lunmap update failed,retry ..
nfnic: <2>: INFO: fnic_handle_report_lun: 1380: Report luns response for target_fcid : 0xaf01e0 target_id:283 num_luns 10
WARNING: nfnic: <2>: fnic_handle_report_lun: 1442: vmk_ScsiScanAndClaimPaths returned BUSY

You will also observe the following events related to StorageFPIN:

WARNING: StorageFPIN: 521: Failed to allocate memory.
WARNING: Heap: 3645: Heap storageFPINHeap already at its maximum size. Cannot expand.

Cause

FPIN (Fabric Performance Impact Notifications) capability was added to ESXi 8.0 U2 to be able to better understand fabric related issues. Due to a bug in the StorageFPIN code, when FPIN tries to allocate memory and is unable to, it can hold onto a reference count on the paths which prevents the Cisco NFNIC driver from being able to allocate new paths or re-establish existing ones.

Resolution

This is a known issue with both FPIN as well as how the Cisco NFNIC driver is coded to behave when there are path losses. The NFNIC driver does not save storage port bindings so when a storage path reestablishes after an outage or path loss, it will simply create brand new paths and increment target numbers. Because of the bug with FPIN keeping a reference count on those paths, the Cisco NFNIC driver is unable to establish new paths.

A code fix to alter the FPIN open reference count behavior will be available in an upcoming ESXi 8.x release.

Cisco will be releasing NFNIC driver 5.0.0.46 that will change the driver behavior so it will use fixed Target IDs: https://bst.cisco.com/quickview/bug/CSCwn00553


To workaround this issue, it is recommended to disable FPIN on ESXi 8.0 hosts, especially when using Cisco UCS and NFNIC:

esxcli storage fpin info set -e false

To confirm the setting:

esxcli storage fpin info get

Note: This setting change does not require a reboot on its own however if an ESXi host is already in a memory heap exhaustion state for storageFPINHeap then rebooting the host is required after making this setting change.