ESXI Host reports PSOD due to async qedi driver in use
search cancel

ESXI Host reports PSOD due to async qedi driver in use

book

Article ID: 318052

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:
Host configured SW iSCSI hit PSOD randomly.
In the host equipped with Qlogic QL41xx or QL45xx CNA, esx might encounter PSOD randomly when SW iscsi is running. The backtrace looks like below.

YYYY-MM-DD HH:MM:SS cpu10:13955491)World: 3072: PRDA 0x420042800000 ss 0x0 ds 0x10b es 0x10b fs 0x10b gs 0x0
YYYY-MM-DD HH:MM:SS cpu10:13955491)World: 3074: TR 0xf58 GDT 0x453880014000 (0xf77) IDT 0x420010b50000 (0xfff)
YYYY-MM-DD HH:MM:SS cpu10:13955491)World: 3075: CR0 0x80010031 CR3 0x8048138000 CR4 0x142768
YYYY-MM-DD HH:MM:SS cpu10:13955491)Backtrace for current CPU #10, worldID=13955491, fp=0x453913c1bd60
YYYY-MM-DD HH:MM:SS cpu10:13955491)0x453913c1bcb8:[0x420010ca533b]vmk_PktListPopFirstPkt@vmkernel#nover+0xb stack: 0x0, 0x45d95ca81480, 0x420011a9b949, 0x1, 0x420010c6932b
YYYY-MM-DD HH:MM:SS cpu10:13955491)0x453913c1bcc0:[0x420011a9b9e5]qedi_UplinkTx@(qedi)#<None>+0xf6 stack: 0x45d95ca81480, 0x420011a9b949, 0x1, 0x420010c6932b, 0x453913c1bd40
YYYY-MM-DD HH:MM:SS cpu10:13955491)0x453913c1bd70:[0x420010c8f495]UplinkDevDoTransmit@vmkernel#nover+0x372 stack: 0x4305b510bd80, 0xa00004410b0885b, 0x453913c1bef0, 0xc5, 0x0
YYYY-MM-DD HH:MM:SS cpu10:13955491)0x453913c1be60:[0x420010c8fb9f]UplinkDevTransmit@vmkernel#nover+0x178 stack: 0x4301914579c0, 0x420042800210, 0x420042800000, 0x420042800218, 0x0

Environment

VMware vSphere ESXi 7.0.3

Cause

The wrong configuration of SW iSCSI makes PSOD randomly. For example, command 'esxcfg-nics -l' lists NICs of qedi as below.

   vmnic2 0000:2b:00.4 qedi Up Up 50000 Full a2:0d:c7:70:00:04 9000 QLogic Corp QLogic FastLinQ QL45xxx Series 10/25/40/50/100 GbE Controller (iSCSI)
   vmnic3 0000:2b:00.5 qedi Up Up 50000 Full a2:0d:c7:70:00:05 9000 QLogic Corp QLogic FastLinQ QL45xxx Series 10/25/40/50/100 GbE Controller (iSCSI)

   These are iSCSI NICs; which are not supposed to be used as normal NICs. However, users are not aware of this and configure the SW iSCSI network ports using the above NICs, which makes the uplink device unstable and crashes randomly.

 

Resolution

 

  •  Validate the NIC configuration using the below steps

esxcfg-scsidevs -a

HBA Name Driver Link State UID Capabilities Description
-------- --------- ---------- ------------------------------------ ------------------- -----------
vmhba0 nvme_pcie link-n/a pcie.200 (0000:02:00.0) Marvell Technology Group Ltd Marvell NR2241 NVMe Controller
vmhba64 qedi unbound iscsi.vmhba64 Second Level Lun ID QLogic FastLinQ QL45xxx Series 10/25/40/50/100 GbE Controller (iSCSI)
vmhba65 qedi unbound iscsi.vmhba65 Second Level Lun ID QLogic FastLinQ QL45xxx Series 10/25/40/50/100 GbE Controller (iSCSI)
vmhba66 iscsi_vmk online iqn.1998-01.com.vmware:am4-esx-vh001 Second Level Lun ID iSCSI Software Adapter

  • Make sure these Vmnic's are not being used for software ISCSI traffic 
  • These are iSCSI NICs; which are not supposed to be used as normal NICs.

Use the NICs of qedentv as networkportal for SW ISCSI or switch to HW ISCSI

vmnic4 0000:86:00.0 qedentv Up 10000Mbps Full f4:e9:d4:ed:a0:7a 1500 QLogic Corp. QLogic FastLinQ QL41xxx 1/10/25 GbE Ethernet Adapter
vmnic5 0000:86:00.1 qedentv Up 10000Mbps Full f4:e9:d4:ed:a0:7b 1500 QLogic Corp. QLogic FastLinQ QL41xxx 1/10/25 GbE Ethernet Adapter

Additional Information

Impact/Risks:
  • During this PSOD ,the host crashes, and all services and VMs running on the host are terminated. The VMs don't get a chance to gracefully shut down, but are instead powered off abruptly. 
  • They are all caused by packet (list) corruption. Since the packets are either received or transmitted via NICs managed by the qedi drivers if it is an ISCSI connection


Attachments