Multipl DB crash during maintenance activity involving vmhba.
search cancel

Multipl DB crash during maintenance activity involving vmhba.

book

Article ID: 394832

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms : 

  • DB VMs show as inaccessible or hung in vCenter.

Validation Step:

  • vmfs/volumes/<datastoreUUID>/<vmname>/vmware.log file reports IO aborts
    YYYY-MM-DDTHH:MM.SSSZ In(05) vcpu-2 - PVSCSI: scsi#:01: aborting cmd 0x35e
    YYYY-MM-DDTHH:MM.SSSZ In(05) vcpu-7 - PVSCSI: scsi#:02: aborting cmd 0x3c2

Environment

VMware vSphere ESXi 6.x
VMware vSphere ESXi 7.x
VMware vSphere ESXi 8.x

Cause

The vmhba was undergoing maintenance but was not properly shut down. As a result, the paths configured using the vmhba remained active, and the ESXi host continued sending I/Os through them due to the round-robin policy.

Cause validation:

  • Verify the paths configured for the affected datastore using the command : "esxcli storage nmp device list".
    esxcli storage nmp device list
    naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:
       Device Display Name: IBM Fibre Channel Disk (naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
       Storage Array Type: VMW_SATP_ALUA
       Storage Array Type Device Config: {implicit_support=on; explicit_support=off; explicit_allow=on; alua_followover=on; action_OnRetryErrors=off; {TPG_id=17,TPG_state=AO}{TPG_id=16,TPG_state=ANO}}
       Path Selection Policy: VMW_PSP_RR
       Path Selection Policy Device Config: {policy=iops,iops=1,bytes=10485760,useANO=0; lastPathIndex=1: NumIOsPending=0,numBytesPending=0}
       Path Selection Policy Device Custom Config:
       Working Paths: vmhba2:C0:T1:L19, vmhba1:C0:T1:L19
       Is USB: false

    From the above output it is confirmed that storage device is configured with 2 working paths with path selection policy as round robin and the iops = 1.

  • var/run/log/vmkernel logs confirm I/O aborts on the paths configured using vmhba2:
    YYYY-MM-DDTHH:MM.SSSZ cpu6:5972189)qlnativefc: vmhba2(8:0.1): qlnativefcEhAbort:2746:qlnativefcEhAbort: abortCommand mbx success.
    YYYY-MM-DDTHH:MM.SSSZ cpu12:5189234)qlnativefc: vmhba2(8:0.1): qlnativefcStatusEntry:2077:C0:T11:L45 - FCP command status: 0x5-0x0 (0x8) portid=01bd81 oxid=0x3e8 cdb=880000 len=16384 rspInfo=0x0 resid=0x0 fwResid=0x0 host status = 0x8 device status =$
    YYYY-MM-DDTHH:MM.SSSZ cpu0:2098223)NMP: nmp_ThrottleLogForDevice:3867: Cmd 0x88 (0x45b9a60b6348, 5972180) to dev "naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" on path "vmhba2:C0:T11:L45" Failed:
    YYYY-MM-DDTHH:MM.SSSZ cpu0:2098223)NMP: nmp_ThrottleLogForDevice:3875: H:0x8 D:0x0 P:0x0 . Act:EVAL. cmdId.initiator=0x430d54680ec0 CmdSN 0x3dc
    YYYY-MM-DDTHH:MM.SSSZ cpu12:5189234)PVSCSI: 2698: scsi1#:01: SCSI ABORT ctx=0x76
    YYYY-MM-DDTHH:MM.SSSZ cpu12:5189234)PVSCSI: 2698: scsi#:02: SCSI ABORT ctx=0x3fb
    YYYY-MM-DDTHH:MM.SSSZ cpu6:5972189)PVSCSI: 2698: scsi#:03: SCSI ABORT ctx=0x3dc
    YYYY-MM-DDTHH:MM.SSSZ cpu3:5189230)PVSCSI: 2698: scsi#:04: SCSI ABORT ctx=0x1d9
    YYYY-MM-DDTHH:MM.SSSZ cpu3:5189230)PVSCSI: 2698: scsi#:05: SCSI ABORT ctx=0x59

    These logs confirm that during the maintenance activity on vmhba2, the vmhba2 was not fully brought down, and the ESXi host continued to try I/Os with both active paths, leading to continuous I/O aborts.

Resolution

Recommendations for maintenance activities involving vmhba:

  • For any activities involving a specific vmhba, please bring down the respective vmhba, which will also take down all associated paths configured through it.
  • This will result in initial IO aborts.
  • Once the path is marked as down, the ESXi host will stop attempting IOs through it, preventing further IO aborts.
  • Ensure that the alternate paths using other vmhba remains active and fully operational during this process.

Additional Information

Refer this KB to know more about : VMware Multipathing policies in ESXi/ESX