PSOD on an ESX host with PF Exception 14 in "lpfc_path_claim_handler"
search cancel

PSOD on an ESX host with PF Exception 14 in "lpfc_path_claim_handler"

book

Article ID: 426284

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • ESX host may experience a Purple Screen of Death (PSOD) during storage array maintenance or upgrades. 

  • /var/run/log/logEFI.log on ESX host

    YYYY-MM-DDTHH:MM:SS In(14) LogEFI[####]: #PF Exception 14 in world #####:lpfc_path_cl IP 0#### addr 0####
    YYYY-MM-DDTHH:MM:SS In(14) LogEFI[####]: PTEs:0####;0####;0x0;
    YYYY-MM-DDTHH:MM:SS In(14) LogEFI[####]: Module(s) involved in panic: [lpfc 900.14.4.390.20-36vmw.901.0.24957456 (External)]
    YYYY-MM-DDTHH:MM:SS In(14) LogEFI: cpu#:####)cr0=0#### cr2=0#### cr3=0#### cr4=0####
    YYYY-MM-DDTHH:MM:SS In(14) LogEFI: cpu#:####)FMS=#### uCode=0####
    YYYY-MM-DDTHH:MM:SS In(14) LogEFI: cpu#:####)frame=0#### ip=0#### err=0x0 rflags=0####
    YYYY-MM-DDTHH:MM:SS In(14) LogEFI: cpu#:####)rax=0x0 rbx=0#### rcx=0x0
    YYYY-MM-DDTHH:MM:SS In(14) LogEFI: cpu#:####)rdx=0#### rbp=0#### rsi=0####
    YYYY-MM-DDTHH:MM:SS In(14) LogEFI: cpu#:####)rdi=0xffffffffffffffff r8=0x0 r9=0x0
    YYYY-MM-DDTHH:MM:SS In(14) LogEFI: cpu#:####)r10=0x0 r11=0x0 r12=0####
    YYYY-MM-DDTHH:MM:SS In(14) LogEFI: cpu#:####)r13=0x1 r14=0#### r15=0####
    YYYY-MM-DDTHH:MM:SS In(14) LogEFI[#####]: *PCPU#:####/lpfc_path_claim-#-#
    YYYY-MM-DDTHH:MM:SS In(14) LogEFI[#####]: PCPU  #: SVVSUVVUVVVUVVSVVVUVVVVVSVVVVSVV
    YYYY-MM-DDTHH:MM:SS In(14) LogEFI: cpu#:####)Code start: 0#### VMK uptime: ##:##:##:##
    YYYY-MM-DDTHH:MM:SS In(14) LogEFI: cpu#:####)####:[0x####]lpfc_path_claim_handler@(lpfc)#<None>+0####stack: 0####
    YYYY-MM-DDTHH:MM:SS In(14) LogEFI: cpu#:####)####:[0x####]lpfc_pathclaim_event@(lpfc)#<None>+0#### stack: 0x####
    YYYY-MM-DDTHH:MM:SS In(14) LogEFI: cpu#:####)####:[0x####]vmkWorldFunc@vmkernel#nover+0#### stack: 0####
    YYYY-MM-DDTHH:MM:SS In(14) LogEFI: cpu#:####)####:[0x####]CpuSched_StartWorld@vmkernel#nover+0#### stack: 0x0
    YYYY-MM-DDTHH:MM:SS In(14) LogEFI: cpu#:####)####:[0x####]Debug_IsInitialized@vmkernel#nover+0#### stack: 0x0

Environment

VMware vSphere ESX 9.0.x

Cause

The crash is due to a race condition in the Emulex lpfc driver logic which is triggered by rapid changes in a storage target's Destination ID (DID). When a target WWPN changes DID and reverts (common during some array upgrades), the driver experiences a use-after-free memory error while trying to process the overlapping path-claim events.

To identify if environment is affected by this specific race condition, identify "DID Flip-Flop" sequence in the logs. 

Target DID Change Sequence
Using WWPN ##:##:##:##:##:##:##:## as an example, here is the log sequence to identify in vmkernel.log

/var/run/log/vmkernel.log

    1. Initial DID (0x100) goes Offline The fabric sends an RSCN notifying the host that the path is gone.
      YYYY-MM-DDTHH:MM:SS cpu##:#### lpfc: lpfc_els_rcv_rscn:###: vmhba# RSCN received event x0 : Address format x00 : DID 0x100
      YYYY-MM-DDTHH:MM:SS cpu##:#### WARNING: lpfc : vmhba# lpfc_start_devloss:####: Start 10 sec devloss tmo WWPN ##:##:##:## NPort 0x100 

    2. Target Reappears with a New DID (0x200) The host discovers the same WWPN at a different fabric address.
      YYYY-MM-DDTHH:MM:SS cpu##:#### lpfc: lpfc_els_rcv_rscn:####: vmhba# RSCN received event x0 : Address format x00 : DID 0x200
      YYYY-MM-DDTHH:MM:SS cpu##:#### lpfc : vmhba# lpfc_cmpl_prli_prli_issue:####: FCP NPR PRLI Cmpl DID 140001 Init 0 Tgt 1 EIP 1 AccCode 0x200

    3. New DID (0x200) goes Offline [The temporary path is dropped, often during a storage controller failback or upgrade.]
      YYYY-MM-DDTHH:MM:SS cpu##:#### lpfc: lpfc_els_rcv_rscn:####: vmhba# #### RSCN received event x0 : Address format x00 : DID 0x200
      YYYY-MM-DDTHH:MM:SS cpu##:#### WARNING: lpfc : vmhba5 lpfc_start_devloss:####: Start 10 sec devloss tmo WWPN ##:##:##:## NPort 0x200

    4. Original DID (0x100) Reverts to Online The target returns to its original address, completing the cycle that triggers the driver race condition.
      YYYY-MM-DDTHH:MM:SS cpu##:#### lpfc: lpfc_els_rcv_rscn:7484: vmhba5 ### RSCN received event x0 : Address format x00 : DID 0x100
      YYYY-MM-DDTHH:MM:SS cpu##:#### lpfc : vmhba# lpfc_cmpl_prli_prli_issue:####: FCP NPR PRLI Cmpl DID 0x100 Init 0 Tgt 1 EIP 1 AccCode 0x100

Resolution

Currently there is no workaround.

Broadcom Engineering is aware of the issue and fix is being developed by Emulex.
The fixed driver version would be released with future release.