ESXi 7.0 host experiences a PSOD when SRIOV and RoCE functions are both enabled in the inbox qedentv driver
search cancel

ESXi 7.0 host experiences a PSOD when SRIOV and RoCE functions are both enabled in the inbox qedentv driver

book

Article ID: 324280

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:
On ESXi 7.0, you experience these symptoms:
  • An ESXi host experiences a (PSOD) Purple Diagnostic Screen.
  • The backtrace contains entries similar to:

    [1]
    2020-02-20T06:14:03.542Z cpu39:1000368420)@BlueScreen: Failed at vmkdrivers/native/Proprietary/Network/qedentv/ecore/ecore_dev.c:4308 -- VMK_ASSERT(!(1))
    2020-02-20T06:14:03.558Z cpu39:1000368420)Code start: 0x420025000000 VMK uptime: 0:03:23:14.344
    2020-02-20T06:14:03.575Z cpu39:1000368420)0x451a59c9ad60:[0x420025162cab]PanicvPanicInt@vmkernel#nover+0x2b3 stack: 0x420025162cab
    2020-02-20T06:14:03.598Z cpu39:1000368420)0x451a59c9ae10:[0x4200251635a2]Panic_vPanic@vmkernel#nover+0x23 stack: 0x4309d78020ed
    2020-02-20T06:14:03.621Z cpu39:1000368420)0x451a59c9ae30:[0x4200251862c0]vmk_PanicWithModuleID@vmkernel#nover+0x41 stack: 0x451a59c9ae90
    2020-02-20T06:14:03.646Z cpu39:1000368420)0x451a59c9ae90:[0x42002648fda3]ecore_pglueb_set_pfid_enable@(qedentv)#<None>+0xd8 stack: 0x4309d7686348
    2020-02-20T06:14:03.673Z cpu39:1000368420)0x451a59c9aec0:[0x4200264a69b6]ecore_recovery_prolog@(qedentv)#<None>+0x2b stack: 0x4309d77f39a0
    2020-02-20T06:14:03.698Z cpu39:1000368420)0x451a59c9aee0:[0x420026475f45]qedentv_recovery_handler@(qedentv)#<None>+0x11e stack: 0x4309d77de5c0
    2020-02-20T06:14:03.723Z cpu39:1000368420)0x451a59c9af00:[0x42002512fa1c]HelperQueueFunc@vmkernel#nover+0x7d9 stack: 0x0
    2020-02-20T06:14:03.744Z cpu39:1000368420)0x451a59c9afd0:[0x420025500110]CpuSched_StartWorld@vmkernel#nover+0xf9 stack: 0x0
    2020-02-20T06:14:03.765Z cpu39:1000368420)0x451a59c9b000:[0x420025115007]Debug_IsInitialized@vmkernel#nover+0x18 stack: 0x0

    [2]
    2020-02-18T00:51:34.588Z cpu38:1001648108)@BlueScreen: Failed at vmkdrivers/native/Proprietary/Network/qedentv/ecore/ecore_spq.c:206 -- VMK_ASSERT(!(1))
    2020-02-18T00:51:34.617Z cpu38:1001648108)Code start: 0x420019000000 VMK uptime: 0:22:58:07.790
    2020-02-18T00:51:34.654Z cpu38:1001648108)0x451a6851aab0:[0x420019162cab]PanicvPanicInt@vmkernel#nover+0x2b3 stack: 0x420019162cab
    2020-02-18T00:51:34.697Z cpu38:1001648108)0x451a6851ab60:[0x4200191635a2]Panic_vPanic@vmkernel#nover+0x23 stack: 0x4501ac15ec48
    2020-02-18T00:51:34.742Z cpu38:1001648108)0x451a6851ab80:[0x4200191862c0]vmk_PanicWithModuleID@vmkernel#nover+0x41 stack: 0x451a6851abe0
    2020-02-18T00:51:34.788Z cpu38:1001648108)0x451a6851abe0:[0x42001a2a5022]ecore_spq_post@(qedentv)#<None>+0x627 stack: 0x6920646f726d6152
    2020-02-18T00:51:34.834Z cpu38:1001648108)0x451a6851ad30:[0x42001a2cdae8]ecore_eth_txq_start_ramrod@(qedentv)#<None>+0xbd stack: 0x2
    2020-02-18T00:51:34.882Z cpu38:1001648108)0x451a6851ad80:[0x42001a2e4319]ecore_iov_process_mbx_req@(qedentv)#<None>+0x1d7e stack: 0x4311201d3202
    2020-02-18T00:51:34.931Z cpu38:1001648108)0x451a6851aeb0:[0x42001a2884c3]qedentv_handle_vf_msg@(qedentv)#<None>+0x1ec stack: 0x42001a2887c4
    2020-02-18T00:51:34.978Z cpu38:1001648108)0x451a6851af20:[0x42001a2887de]qedentv_sriov_task@(qedentv)#<None>+0x1ff stack: 0x4200191c32c8
    2020-02-18T00:51:35.023Z cpu38:1001648108)0x451a6851af70:[0x420019191048]vmkWorldFunc@vmkernel#nover+0x6d stack: 0x420019191044
    2020-02-18T00:51:35.064Z cpu38:1001648108)0x451a6851afd0:[0x420019500110]CpuSched_StartWorld@vmkernel#nover+0xf9 stack: 0x0
    2020-02-18T00:51:35.105Z cpu38:1001648108)0x451a6851b000:[0x420019115007]Debug_IsInitialized@vmkernel#nover+0x18 stack: 0x0

    [3]
    2020-02-19T06:38:04.710Z cpu28:1001439035)@BlueScreen: Failed at vmkdrivers/native/Proprietary/Network/qedentv/ecore/ecore_int.c:439 -- VMK_ASSERT(!(1))
    2020-02-19T06:38:04.738Z cpu28:1001439035)Code start: 0x420036400000 VMK uptime: 0:02:03:23.802
    2020-02-19T06:38:04.776Z cpu28:1001439035)0x451a77a99e50:[0x420036562cab]PanicvPanicInt@vmkernel#nover+0x2b3 stack: 0x420036562cab
    2020-02-19T06:38:04.819Z cpu28:1001439035)0x451a77a99f00:[0x4200365635a2]Panic_vPanic@vmkernel#nover+0x23 stack: 0x43120f6d4c6d
    2020-02-19T06:38:04.863Z cpu28:1001439035)0x451a77a99f20:[0x4200365862c0]vmk_PanicWithModuleID@vmkernel#nover+0x41 stack: 0x451a77a99f80
    2020-02-19T06:38:04.910Z cpu28:1001439035)0x451a77a99f80:[0x4200376b9539]ecore_fw_assertion@(qedentv)#<None>+0xbe stack: 0x451a77a9a090
    2020-02-19T06:38:04.958Z cpu28:1001439035)0x451a77a9a090:[0x4200376ba73b]ecore_int_deassertion@(qedentv)#<None>+0x454 stack: 0x5a30333600000001
    2020-02-19T06:38:05.003Z cpu28:1001439035)0x451a77a9a280:[0x4200376bb1fb]ecore_int_sp_dpc@(qedentv)#<None>+0x4e0 stack: 0x1
    2020-02-19T06:38:05.043Z cpu28:1001439035)0x451a77a9a2f0:[0x420036534e21]IntrCookieBH@vmkernel#nover+0x336 stack: 0x3e8
    2020-02-19T06:38:05.080Z cpu28:1001439035)0x451a77a9a3a0:[0x420036507a98]BH_Check@vmkernel#nover+0x349 stack: 0x7
    2020-02-19T06:38:05.123Z cpu28:1001439035)0x451a77a9a440:[0x420036a4055a]UserMem_HandleMapFault@vmkernel#nover+0x1d03 stack: 0x431baea02010
    2020-02-19T06:38:05.170Z cpu28:1001439035)0x451a77a9ae60:[0x420036b055f7]User_ArchExceptionHandleFault@vmkernel#nover+0x1b0 stack: 0x0
    2020-02-19T06:38:05.213Z cpu28:1001439035)0x451a77a9aec0:[0x420036a190e6]User_Exception@vmkernel#nover+0x183 stack: 0x3c0000000
    2020-02-19T06:38:05.252Z cpu28:1001439035)0x451a77a9af20:[0x4200365d99e5]Int14_PF@vmkernel#nover+0x406 stack: 0x0
    2020-02-19T06:38:05.290Z cpu28:1001439035)0x451a77a9af40:[0x4200365d1076]gate_entry@vmkernel#nover+0x77 stack: 0x80010031


    Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.


Environment

VMware vSphere ESXi 7.0.0

Cause

This issue occurs as sometimes, Hardware Parity attentions (i.e. TCM, XCM) are hit with SRIOV+RoCE configuration which leads to engine reset recovery. 

In this driver version, during engine reset recovery with VFs, the driver can go to bad state and cause a PSOD as VF driver is not aware of the engine reset.

Resolution

To resolve this issue, ensure the RoCE function is disabled when SRIOV function is in use.

Note: RoCE function is disabled in qedentv by default. VMware recommends to not set "enable_roce=1" in the qedentv module parameter when SRIOV function is in use.