Multipl DB crash during maintenance activity involving vmhba.

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms :

DB VMs show as inaccessible or hung in vCenter.

Validation Step:

vmfs/volumes/<datastoreUUID>/<vmname>/vmware.log file reports IO aborts
YYYY-MM-DDTHH:MM.SSSZ In(05) vcpu-2 - PVSCSI: scsi#:01: aborting cmd 0x35e
YYYY-MM-DDTHH:MM.SSSZ In(05) vcpu-7 - PVSCSI: scsi#:02: aborting cmd 0x3c2

Environment

VMware vSphere ESXi 6.x
VMware vSphere ESXi 7.x
VMware vSphere ESXi 8.x

Cause

The vmhba was undergoing maintenance but was not properly shut down. As a result, the paths configured using the vmhba remained active, and the ESXi host continued sending I/Os through them due to the round-robin policy.

Cause validation:

Verify the paths configured for the affected datastore using the command : "esxcli storage nmp device list".
esxcli storage nmp device list
naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:
Device Display Name: IBM Fibre Channel Disk (naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
Storage Array Type: VMW_SATP_ALUA
Storage Array Type Device Config: {implicit_support=on; explicit_support=off; explicit_allow=on; alua_followover=on; action_OnRetryErrors=off; {TPG_id=17,TPG_state=AO}{TPG_id=16,TPG_state=ANO}}
Path Selection Policy: VMW_PSP_RR
Path Selection Policy Device Config: {policy=iops,iops=1,bytes=10485760,useANO=0; lastPathIndex=1: NumIOsPending=0,numBytesPending=0}
Path Selection Policy Device Custom Config:
Working Paths: vmhba2:C0:T1:L19, vmhba1:C0:T1:L19
Is USB: false

From the above output it is confirmed that storage device is configured with 2 working paths with path selection policy as round robin and the iops = 1.
var/run/log/vmkernel logs confirm I/O aborts on the paths configured using vmhba2:

YYYY-MM-DDTHH:MM.SSSZ cpu6:5972189)qlnativefc: vmhba2(8:0.1): qlnativefcEhAbort:2746:qlnativefcEhAbort: abortCommand mbx success.
YYYY-MM-DDTHH:MM.SSSZ cpu12:5189234)qlnativefc: vmhba2(8:0.1): qlnativefcStatusEntry:2077:C0:T11:L45 - FCP command status: 0x5-0x0 (0x8) portid=01bd81 oxid=0x3e8 cdb=880000 len=16384 rspInfo=0x0 resid=0x0 fwResid=0x0 host status = 0x8 device status =$
YYYY-MM-DDTHH:MM.SSSZ cpu0:2098223)NMP: nmp_ThrottleLogForDevice:3867: Cmd 0x88 (0x45b9a60b6348, 5972180) to dev "naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" on path "vmhba2:C0:T11:L45" Failed:
YYYY-MM-DDTHH:MM.SSSZ cpu0:2098223)NMP: nmp_ThrottleLogForDevice:3875: H:0x8 D:0x0 P:0x0 . Act:EVAL. cmdId.initiator=0x430d54680ec0 CmdSN 0x3dc
YYYY-MM-DDTHH:MM.SSSZ cpu12:5189234)PVSCSI: 2698: scsi1#:01: SCSI ABORT ctx=0x76
YYYY-MM-DDTHH:MM.SSSZ cpu12:5189234)PVSCSI: 2698: scsi#:02: SCSI ABORT ctx=0x3fb
YYYY-MM-DDTHH:MM.SSSZ cpu6:5972189)PVSCSI: 2698: scsi#:03: SCSI ABORT ctx=0x3dc
YYYY-MM-DDTHH:MM.SSSZ cpu3:5189230)PVSCSI: 2698: scsi#:04: SCSI ABORT ctx=0x1d9
YYYY-MM-DDTHH:MM.SSSZ cpu3:5189230)PVSCSI: 2698: scsi#:05: SCSI ABORT ctx=0x59

These logs confirm that during the maintenance activity on vmhba2, the vmhba2 was not fully brought down, and the ESXi host continued to try I/Os with both active paths, leading to continuous I/O aborts.

Resolution

Recommendations for maintenance activities involving vmhba:

For any activities involving a specific vmhba, please bring down the respective vmhba, which will also take down all associated paths configured through it.
This will result in initial IO aborts.
Once the path is marked as down, the ESXi host will stop attempting IOs through it, preventing further IO aborts.
Ensure that the alternate paths using other vmhba remains active and fully operational during this process.

Additional Information

Refer this KB to know more about : VMware Multipathing policies in ESXi/ESX