ESXi host fails with purple diagnostic screen (PSOD),"PF Exception 14 in world 2098213:tq:tq-iport IP 0x4200033dfa22 addr 0x45d9a6800f1c"
search cancel

ESXi host fails with purple diagnostic screen (PSOD),"PF Exception 14 in world 2098213:tq:tq-iport IP 0x4200033dfa22 addr 0x45d9a6800f1c"

book

Article ID: 425684

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • An ESXi host fails with a Purple Diagnostic Screen (PSOD). The crash screen displays a Page Fault exception similar to the following:
    PF Exception 14 in world <World_ID>:tq:tq-iport IP 0x4200033dfa22 addr 0x45d9a6800f1c

  • Following snippet is observed on the ESXi console : 

         

  • Analysis of the /var/run/log/vmkernel.log file  on ESXi host prior to the crash reveals a sequence of events where the nfnic driver repeatedly fails to update the LUN map. This is followed by memory allocation failures, indicating heap exhaustion
    YYYY-MM-ddTHH:MM:SS.377Z In(182) vmkernel: cpu94:2098204)nfnic: <1>: INFO: fnic_free_lun_list_by_no_active_lun: 658: FNIC free lun list:fcid:<ID>, lun:5
    YYYY-MM-ddTHH:MM:SS.377Z Wa(180) vmkwarning: cpu94:2098204)WARNING: nfnic: <1>: fnic_handle_report_lun: 1533: lun add failure! in_remove: 0 ioAllowed: 1
    YYYY-MM-ddTHH:MM:SS.377Z Wa(180) vmkwarning: cpu94:2098204)WARNING: nfnic: <1>: fnic_tport_event_handler: 2136: lunmap update failed,retry ..

  • After an interval of continuous retries, the same logs show the system failing to allocate memory, confirming the exhaustion of the VMkernel heap:
    YYYY-MM-ddTHH:MM:SS.377Z Wa(180) vmkwarning: cpu62:2097957)WARNING: StorageFPIN: 521: Failed to allocate memory
    YYYY-MM-ddTHH:MM:SS.505Z Wa(180) vmkwarning: cpu74:2097957)WARNING: Heap: 3645: Heap storageFPINHeap already at its maximum size. Cannot expand.

Environment

VMware ESXi Version: 8.0 U3

Cisco UCS Servers

Cause

FPIN (Fabric Performance Impact Notifications) capability was added to ESXi 8.0 U2 to be able to better understand fabric related issues. Due to a bug in the StorageFPIN code, when FPIN tries to allocate memory and is unable to, it can hold onto a reference count on the paths which prevents the Cisco NFNIC driver from being able to allocate new paths or re-establish existing ones.

Refer to : Temporary/transient storage path loss on Host could result in paths not coming back when using Cisco UCS and NFNIC

In certain scenarios involving continuous retries, this allocated memory is not immediately released back to the system. Over time, this behavior leads to the exhaustion of the available system memory (heap), eventually causing the host to become unresponsive and display a diagnostic screen.

Resolution

This is a known issue involving the nfnic driver path handling and ESXi FPIN reference counting. Implement one of the following to fix it. 

1. Upgrade to ESXi 8.0 U3e (Build 24674464) or later.

2. Update the Cisco nfnic driver to version 5.0.0.48 or later.