Remediation of the issue by means of updating drivers
Symptoms:
The ESXi host may go into a PSOD state with the back trace as below :
[YYYY-MM-DDTHH:MM:SS] cpu25:2098138)@BlueScreen: #PF Exception 14 in world 2098138:ql_fcoe_dela IP 0x42002b40159c addr 0x128
PTEs:0x14f2fa023;0x14f2fb023;0x14f2fc023;0x0;
[YYYY-MM-DDTHH:MM:SS] cpu25:2098138)Code start: 0x42002a400000 VMK uptime: 82:21:43:01.789
[YYYY-MM-DDTHH:MM:SS] cpu25:2098138)0x4538d989bf28:[0x42002b40159c]CommandPumpOnPassiveLevel@(qedf)#<None>+0x0 stack: 0x43127a373000
[YYYY-MM-DDTHH:MM:SS] cpu25:2098138)0x4538d989bf30:[0x42002b3e684a]SendFCoEVlanSolicitation@(qedf)#<None>+0x353 stack: 0x43127a373018
[YYYY-MM-DDTHH:MM:SS] cpu25:2098138)0x4538d989bf50:[0x42002b3e7013]FipVlanTimeoutWork@(qedf)#<None>+0x15c stack: 0x43127a373018
[YYYY-MM-DDTHH:MM:SS] cpu25:2098138)0x4538d989bf70:[0x42002b3ff711]ql_fcoe_do_singlethread_work@(qedf)#<None>+0x76 stack: 0x43127a373000
[YYYY-MM-DDTHH:MM:SS] cpu25:2098138)0x4538d989bf90:[0x42002a51e224]vmkWorldFunc@vmkernel#nover+0x49 stack: 0x42002a51e220
[YYYY-MM-DDTHH:MM:SS] cpu25:2098138)0x4538d989bfe0:[0x42002a7b3b09]CpuSched_StartWorld@vmkernel#nover+0x86 stack: 0x0
[YYYY-MM-DDTHH:MM:SS] cpu25:2098138)0x4538d989c000:[0x42002a4c4d7f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
VMware vSphere ESXi 7.0
Checking with command "localcli storage core adapter list" may show devices that are using the qedf driver.
The vmkernel.log file may show entries similar to below lines:
[YYYY-MM-DDTHH:MM:SS] cpu26:2097871)qedf:vmhba0:qedfc_link_update_handler:1926:Info: ST(LINK): LINK_DOWN->LINK_UP
[YYYY-MM-DDTHH:MM:SS] cpu26:2097871)qedf:vmhba0:qedfc_link_update_handler:1897:Info: ST(LINK): LINK_UP->LINK_DOWN
[YYYY-MM-DDTHH:MM:SS] cpu26:2097871)qedf:vmhba0:qedfc_link_update_handler:1926:Info: ST(LINK): LINK_DOWN->LINK_UP
[YYYY-MM-DDTHH:MM:SS] cpu26:2097871)qedf:vmhba0:qedfc_link_update_handler:1897:Info: ST(LINK): LINK_UP->LINK_DOWN
[YYYY-MM-DDTHH:MM:SS] cpu26:2097871)qedf:vmhba0:qedfc_link_update_handler:1926:Info: ST(LINK): LINK_DOWN->LINK_UP
[YYYY-MM-DDTHH:MM:SS] cpu26:2097871)qedf:vmhba0:qedfc_link_update_handler:1897:Info: ST(LINK): LINK_UP->LINK_DOWN
[YYYY-MM-DDTHH:MM:SS] cpu26:2097871)qedf:vmhba0:qedfc_link_update_handler:1926:Info: ST(LINK): LINK_DOWN->LINK_UP
[YYYY-MM-DDTHH:MM:SS] cpu26:2097871)qedf:vmhba0:qedfc_link_update_handler:1897:Info: ST(LINK): LINK_UP->LINK_DOWN
[YYYY-MM-DDTHH:MM:SS] cpu26:2097871)qedf:vmhba0:qedfc_link_update_handler:1926:Info: ST(LINK): LINK_DOWN->LINK_UP
Update the qedf driver to driver version 2.74.1.0-1OEM
Driver for 45000/41000 Series Adapters :
https://customerconnect.vmware.com/downloads/details?downloadGroup=DT-ESXI70-MARVELL-E4-CNA-DRIVER-BUNDLE-503820&productId=974
Checking the release notes for this driver we see that the issue has been fixed :
QLogic qedf VMware ESX Native Driver for ESXi 7.0/8.0 Copyright (c) 2015-2019 Cavium Inc. Copyright (c) 2019-2020 Marvell Semiconductor, Inc. All rights reserved Version: 2.74.1.0 =========================== Enhancements: ------------- - Update to qed-8.74.0.0 with storm fw 8.72.1.0 Fixes: ------ * [FJT-9121] : PSOD due to race condition between SendFCoEVlanSolicitation and LogoutAllFabrics. Resolution : Add mechanism of sync between SendFCoEVlanSolicitation and LogoutAllFabrics. Scope : 45000/41000 Series Adapters
PSOD due to race condition between SendFCoEVlanSolicitation and LogoutAllFabrics.
Host goes into a PSOD state