PSOD Can Occur When Using QFLE3 Driver
search cancel

PSOD Can Occur When Using QFLE3 Driver

book

Article ID: 318010

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

A host will crash with a purple screen

  • The  PSOD will have backtraces similar to the following:
2020-05-07T08:30:26.104Z cpu59:5261210)WARNING: Heartbeat: 760: PCPU 44 didn't have a heartbeat for 21 seconds; *may* be locked up.^[[0m
2020-05-07T08:30:26.104Z cpu44:2097436)ALERT: NMI: 696: NMI IPI: RIPOFF(base):RBP:CS [0xc7490(0x418001000000):0x4302ee371a80:0xfc8] (Src 0x1, CPU44)^[[0m
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1b788:[0x4180010c748f]SafeMemAccess_CmpXchg4ExceptionPossible@vmkernel#nover+0xe stack: 0x4302ee371d40
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1b790:[0x418001157a50]FastSlabCreateObj@vmkernel#nover+0x88 stack: 0x100000001
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1b810:[0x418001158013]FastSlabReplenishCPU@vmkernel#nover+0x6e stack: 0x41804b005fd0
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1b850:[0x418001156425]FastSlabAllocSlow@vmkernel#nover+0x7e stack: 0x0
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1b870:[0x4180011564de]FastSlab_AllocWithTimeout@vmkernel#nover+0x83 stack: 0x451b88e1b9b8
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1b8c0:[0x41800103c959]vmk_PageSlabAlloc@vmkernel#nover+0x22 stack: 0x451b00000800
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1b8d0:[0x4180011d30aa]PktPageAlloc_AllocPages@vmkernel#nover+0x37 stack: 0x451b88e1b950
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1b950:[0x41800125f56b]vmk_PktAllocPage@vmkernel#nover+0x10 stack: 0x4310177ed010
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1b960:[0x418001b3b295]qfle3_page_alloc_and_map@(qfle3)#<None>+0x22 stack: 0xeb4bbd3
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1b9b0:[0x418001b51235]qfle3_alloc_rx_sge_mbuf@(qfle3)#<None>+0x2e stack: 0x3f
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1b9f0:[0x418001b517d4]qfle3_alloc_fp_buffers@(qfle3)#<None>+0x2f5 stack: 0x2d35302d30323032
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1ba60:[0x418001b3d000]qfle3_rq_create@(qfle3)#<None>+0x3a9 stack: 0x0
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1bae0:[0x418001af4a77]qfle3_cmd_create_q@(qfle3)#<None>+0x15c stack: 0x0
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1bb30:[0x418001b2b65e]qfle3_sm_q_cmd@(qfle3)#<None>+0x147 stack: 0x10
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1bbb0:[0x418001b3c9c2]qfle3_rq_alloc@(qfle3)#<None>+0x2d7 stack: 0x4307036b2780
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1bc40:[0x4180012dd8bd]UplinkNetq_AllocHwQueueWithAttr@vmkernel#nover+0x92 stack: 0x17
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1bc90:[0x418001217435]NetqueueBalActivatePendingRxQueues@vmkernel#nover+0x156 stack: 0x79e28088
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1bd50:[0x418001218075]NetqueueBalRxQueueCommitChanges@vmkernel#nover+0x36 stack: 0x0
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1bd90:[0x41800121b677]UplinkNetqueueBal_BalanceCB@vmkernel#nover+0x19fc stack: 0x430779e7f1d0
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1bf00:[0x4180012d8309]UplinkAsyncProcessCallsHelperCB@vmkernel#nover+0x116 stack: 0x43090803f7b0
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1bf30:[0x4180010eaf7a]HelperQueueFunc@vmkernel#nover+0x157 stack: 0x43090803f0b8
2020-05-07T08:30:26.104Z cpu44:2097436)0x451b88e1bfe0:[0x41800130f9f2]CpuSched_StartWorld@vmkernel#nover+0x77 stack: 0x0
2020-05-07T08:30:30.214Z cpu11:2626161)VMotion: 5367: 4078885979155334064 S: Another pre-copy iteration needed with 377085 pages left to send (prev2 8388608, prev 8388608, pages dirtied by pass through device 0, network bandwidth ~1028.528 MB/s, 5663% t$
  • In additional to the similar PSOD above, The following messages from the qfle3 driver might be present in the /var/run/log/vmkernel.log:
[7m2020-05-07T05:19:56.921Z cpu58:2097436)WARNING: qfle3: ecore_state_wait:315: timeout waiting for state 10
[7m2020-05-07T05:19:56.921Z cpu58:2097436)WARNING: qfle3: qfle3_remove_queue_filter:2370: [vmnic5] RX 3 queue state not changed for fid: 0
[7m2020-05-07T05:19:56.922Z cpu58:2097436)WARNING: qfle3: ecore_queue_chk_transition:5969: Blocking transition since pending was 400
[7m2020-05-07T05:19:56.922Z cpu58:2097436)WARNING: qfle3: ecore_queue_state_change:4855: check transition returned an error. rc -2



Environment

 ESX 6.7
 ESX 7.0

Resolution

Qlogic has released a new driver for ESXi 6.7 and 7.0 to address this issue:

  • ESXi 6.7: Version 1.1.9.0
  • ESXi 7.0: Version 1.4.8.0

To acquire the correct driver version listed above, contact the hardware OEM or QLogic.

Workaround:

  1. SSH to the ESXi host via root
  2. Set a qfle3 module parameter to relieve the fastslab pressure by reducing qfle3 driver queue/ring buffer requests:
    esxcli system module parameters set -p "txqueue_nr=4 rxqueue_nr=4 rss_engine_nr=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 txring_bd_nr=1024 rxring_bd_nr=1024 enable_lro=0" -m qfle3

  3. Reboot the ESXi host