PSOD on ESXI 8.0 U3 for world running 3rd Party Multi Path Plugin
search cancel

PSOD on ESXI 8.0 U3 for world running 3rd Party Multi Path Plugin

book

Article ID: 387956

calendar_today

Updated On:

Products

VMware vSphere ESXi 8.0

Issue/Introduction

Symptoms:

  • With ESXi 8.0 U2, FPIN (Fabric Performance Impact Notifications) capability was added.
  • FPIN messages proactively alert devices within a Fabric network about specific conditions that may impact performance
  • With ESXi 8.0 U3, this capability was extended to the Multi Path Plugins (MPPs). The received notifications can now be forwarded to the MPPs
  • Partner plugins developed using ESXi SDK version lesser than 8.0U3 do not support FPIN handling.
  • In a in a SAN configuration where FPIN notifications are supported, ESXi host upon receiving FPIN notifications may crash with a PSOD if MPP developed using SDK version lesser than 8.0U3 is installed on the ESXi host.

Validation:

  • In /var/log/vmkernel.log, you would notice the below events when a FPIN is received:
[YYYY-MM-DDTHH:MM:SS ]cpu70:2098697)WARNING: lpfc: lpfc_els_rcv_fpin_cgn:7657: vmhba4 4657 FPIN CONGESTION WARNING Notification type Credit Stall (x2) Event Duration 10000 mSecs
[YYYY-MM-DDTHH:MM:SS] cpu70:2098697)StorageFPIN: 1279: Report FC FPIN Congestion Credit Stall event (hostWWPN 100000109bf4bbbf tgtWWPN 50000975b019bc0a) to vobd. 0 events have occurred since last report.
[YYYY-MM-DDTHH:MM:SS] cpu70:2098697)StoragePath: 5394: Calling MPP PowerPath for link event 2 on adapter v 0x0

[YYYY-MM-DDTHH:MM:SS] cpu70:2098697)World: 3357: TR 0x768 GDT 0xfffffffffca02888 (0xffff) IDT 0xfffffffffc408000 (0xffff)
[YYYY-MM-DDTHH:MM:SS] cpu70:2098697)World: 3359: CR0 0x80050033 CR3 0x1824007e000 CR4 0x152668
[YYYY-MM-DDTHH:MM:SS]  cpu109:2098704)WARNING: lpfc: lpfc_els_rcv_fpin_cgn:7657: vmhba6 4657 FPIN CONGESTION WARNING Notification type Credit Stall (x2) Event Duration 10000 mSecsESC[0m
[YYYY-MM-DDTHH:MM:SS] cpu109:2098704)StorageFPIN: 1279: Report FC FPIN Congestion Credit Stall event (hostWWPN 100070b7e40523e2 tgtWWPN 50000975b019bc0a) to vobd. 0 events have occurred since last report.
[YYYY-MM-DDTHH:MM:SS] cpu109:2098704)StoragePath: 5394: Calling MPP PowerPath for link event 2 on adapter vmhba6 (hostWWPN=0x100070b7e40523e2 targetWWPN=0xffffffffffffffff targetNum = 4294967295)
[YYYY-MM-DDTHH:MM:SS] cpu109:2098704)World: 3355: PRDA 0x42005b400000 ss 0x0 ds 0x750 es 0x750 fs 0x750 gs 0x750
[YYYY-MM-DDTHH:MM:SS] cpu109:2098704)World: 3357: TR 0x758 GDT 0x453b4073d888 (0xffff) IDT 0x42001049d000 (0xffff)
[YYYY-MM-DDTHH:MM:SS] cpu109:2098704)World: 3359: CR0 0x8001003d CR3 0x5eee4000 CR4 0x14216c2024-08-05T18:35:47.560Z cpu70:2098697)WARNING: SelfDiagnostics: 550: Failed to do successful stack walk. Stack is corruptmhba4 (hostWWPN=0x100000109bf4bbbf targetWWPN=0xffffffffffffffff targetNum = 4294967295)
[YYYY-MM-DDTHH:MM:SS] cpu70:2098697)World: 3355: PRDA 0x420051800000 ss 0x0 ds 0x750 es 0x750 fs 0x0 gs
  • Upon receiving the FPIN, the ESXi may crash for world running the MPP similar to the backtrace below:
[[YYYY-MM-DDTHH:MM:SS] cpu109:2098704)Backtrace for current CPU #109, worldID=2098704, fp=0x47
[YYYY-MM-DDTHH:MM:SS] cpu109:2098704)0x453bdc89b620:[0x42000ff7bb40]PanicvPanicInt@vmkernel#nover+0x20c stack: 0x100, 0x42000ff7bb40, 0x47, 0x42000ff7c3bf, 0x42000ff7bb40
[YYYY-MM-DDTHH:MM:SS] cpu109:2098704)0x453bdc89b6d0:[0x42000ff7c3be]Panic_ExceptionMsg@vmkernel#nover+0x57 stack: 0x453bdc89b740, 0x453bdc89b6f0, 0x6e696c6c6143203a, 0xffffc1e0f0783000, 0x420010503465
[YYYY-MM-DDTHH:MM:SS] cpu109:2098704)0x453bdc89b740:[0x4200104a6ddf]Panic_Exception@vmkernel#nover+0x144 stack: 0x453bdc89b78e, 0x42001053414d, 0x74206666666666, 0x3fffffffff, 0x363934393234203d
[YYYY-MM-DDTHH:MM:SS] cpu109:2098704)0x453bdc89b870:[0x4200104a0285]IDTReturnPrepare@vmkernel#nover+0x14a stack: 0x0, 0x0, 0x0, 0x42001049b0c7, 0x750
[YYYY-MM-DDTHH:MM:SS] cpu109:2098704)0x453bdc89b8a0:[0x42001049b0c6]gate_entry@vmkernel#nover+0xa7 stack: 0x0, 0x0, 0x4e20, 0x2, 0x431ce1e012c0
[YYYY-MM-DDTHH:MM:SS] cpu109:2098704)Panic: 783: Halting PCPU 109.
[YYYY-MM-DDTHH:MM:SS] cpu70:2098697)Backtrace for current CPU #70, worldID=2098697, fp=0x453be0e1ba88
[YYYY-MM-DDTHH:MM:SS]ccpu70:2098697)VMware ESXi 8.0.3 [Releasebuild-24022510 x86_64]#PF Exception 14 in world 2098697:lpfc_do_work IP 0x0 addr 0x0

Environment

  • VMware vSphere ESXi 8.0 U3. 
  • SAN Fabric (FC switch) with FPIN notifications enabled.
  • Partner MPP compiled against VMKAPI SDK less than 8.0U3 OR Partner MPP that does not implement FPIN event handler.

Cause

  • MPPs developed using VMKAPI SDK version lesser than 8.0U3 do not have access to the new FPIN VMKAPIs definitions and it is also possible that MPPs compiled against latest SDK might not implement the event handler.
  • The ESXi host crashes with the PSOD, since ESXi 8.0 U3 and higher assumes that MPP always implements the event handler interface starting.

Resolution

This issue is addressed in vSphere ESXi 8.0 U3e Build 24022510.

Workaround:

  • Disable the FPIN notifications processing in the VMkernel storage layer by using the esxcli command line.
  • The command saves the FPIN activation to both ConfigStore and the VMkernel System Interface Shell and persists across ESXi reboots.
  • Following is the command used to activate or deactivate the Fabric Performance Impact Notification (FPIN).
    #esxcli storage fpin info set -e= <true/false>
  • Above configuration change can be verified using
    #esxcli storage fpin info get

Additional Information