Temporary/transient storage path loss on Host could result in paths not coming back when using Cisco UCS and NFNIC
search cancel

Temporary/transient storage path loss on Host could result in paths not coming back when using Cisco UCS and NFNIC

book

Article ID: 380321

calendar_today

Updated On:

Products

VMware vSphere ESXi 8.0

Issue/Introduction

  • After performing a unmount and detach of the datastore, rescanning HBA task fails with error 'an error occurred while communicating with the remote host'.

  • When performing SAN maintenance or while having an unexpected storage path outage, the NFNIC driver will be unable to add paths back. The following sequence is recorded continuously in /var/log/vmkernel.log:

    WARNING: nfnic: <2>: fnic_handle_report_lun: 1467: lun add failure! in_remove: 0 ioAllowed: 1
    WARNING: nfnic: <2>: fnic_tport_event_handler: 2130: lunmap update failed,retry ..
    nfnic: <2>: INFO: fnic_handle_report_lun: 1380: Report luns response for target_fcid : 0xaf01e0 target_id:283 num_luns 10
    WARNING: nfnic: <2>: fnic_handle_report_lun: 1442: vmk_ScsiScanAndClaimPaths returned BUSY
  • Note the following events related to StorageFPIN:
    WARNING: StorageFPIN: 521: Failed to allocate memory.
    WARNING: Heap: 3645: Heap storageFPINHeap already at its maximum size. Cannot expand.
  • The following additional messages may also appear in /var/log/vmkernel.log:
    YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu20:2097755)WARNING: nfnic: <1>: fnic_handle_report_lun: 1465: lun add failure! in_remove: 0 ioAllowed: 1
    YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu20:2097755)WARNING: nfnic: <1>: fnic_tport_event_handler: 2129: lunmap update failed,retry ..
    YYYY-MM-DDTHH:MM:SSZ In(182) vmkernel: cpu20:2097755)nfnic: <1>: INFO: fnic_handle_report_lun: 1379: Report luns response for target_fcid : 0xa02a1 target_id:72 num_luns 8
    YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu8:2097762)WARNING: nfnic: <2>: fnic_handle_report_lun: 1442: vmk_ScsiScanAndClaimPaths returned BUSY
    YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu20:2097755)WARNING: nfnic: <1>: fnic_handle_report_lun: 1442: vmk_ScsiScanAndClaimPaths returned BUSY
    YYYY-MM-DDTHH:MM:SSZ In(182) vmkernel: cpu8:89469270)zdriver: _zmod_periodic:348:  #0- logs are not pulled
    YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu8:2097762)WARNING: nfnic: <2>: fnic_handle_report_lun: 1442: vmk_ScsiScanAndClaimPaths returned BUSY
    YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu20:2097755)WARNING: nfnic: <1>: fnic_handle_report_lun: 1442: vmk_ScsiScanAndClaimPaths returned BUSY
    YYYY-MM-DDTHH:MM:SSZ In(182) vmkernel: cpu8:89469270)zdriver: _zmod_periodic:348:  #0- logs are not pulled
    
    YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu27:2097450)WARNING: StorageFPIN: 521: Failed to allocate memory.
    YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu27:2097450)WARNING: StorageFPIN: 521: Failed to allocate memory.
    YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu21:2097450)WARNING: StorageFPIN: 521: Failed to allocate memory.
    YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu23:2097450)WARNING: StorageFPIN: 521: Failed to allocate memory.
     
    YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu27:2097450)WARNING: Heap: 3645: Heap storageFPINHeap already at its maximum size. Cannot expand.
    YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu27:2097450)WARNING: Heap: 3645: Heap storageFPINHeap already at its maximum size. Cannot expand.
    YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu21:2097450)WARNING: Heap: 3645: Heap storageFPINHeap already at its maximum size. Cannot expand.
    YYYY-MM-DDTHH:MM:SSZ Wa(180) vmkwarning: cpu23:2097450)WARNING: Heap: 3645: Heap storageFPINHeap already at its maximum size. Cannot expand.
     

 

Environment

ESXi 8.x

Cisco UCS Servers

Cause

  • FPIN (Fabric Performance Impact Notifications) capability was added to ESXi 8.0 U2 to be able to better understand fabric related issues. Due to a bug in the StorageFPIN code, when FPIN tries to allocate memory and is unable to, it can hold onto a reference count on the paths which prevents the Cisco NFNIC driver from being able to allocate new paths or re-establish existing ones.

Resolution

This is a known issue with both FPIN as well as how the Cisco NFNIC driver is coded to behave when there are paths lost. The NFNIC driver does not save storage port bindings so when a storage path reestablishes after an outage or path loss, it will simply create brand new paths and increment target numbers. Because of the bug with FPIN keeping a reference count on those paths, the Cisco NFNIC driver is unable to establish new paths. Cisco is set to release NFNIC driver version 5.0.0.48 that introduces fixed target IDs which will workaround this issue.

There is currently only one fix for this issue:

  • A code fix to alter the FPIN open reference count behavior is now available in ESXi 8.0 U3e (build 24674464)