Lost access to volumes correlated with FC driver aborts
search cancel

Lost access to volumes correlated with FC driver aborts

book

Article ID: 396366

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere ESX 7.x VMware vSphere ESX 8.x

Issue/Introduction

ESXi hosts intermittently report "Lost access to volume" for volumes backed by FC storage:

"Lost access to volume <Datastore> due to connectivity issues. Recovery attempt is in progress"

The FC driver intermittently reports aborts of I/O for sustained periods, correlated with the lost access to volumes:

cpu##:2117121)lpfc: lpfc_handle_status:5637: 0:(0):3271: FCP cmd x89 failed <2/354> sid x521d03, did x520304, oxid x363 iotag x689 Abort Requested Host Abort Req
cpu##:18338096)NMP: nmp_ThrottleLogForDevice:3867: Cmd 0x89 (0x45da66bdf208, 2097225) to dev "naa.################################" on path "vmhba#:C#:T#:L##" Failed:


/var/log/vmkernel.log reports I/O aborts (H:0x5 SCSI code) and resets (H:0x8 SCSI code) at these times.

LUN Busy events also observed  

####-##-##T##:##:##.###Z cpu##:2098412)ScsiDeviceIO: 4115: Cmd(0x45d988625dc8) 0x8a, CmdSN 0x800e000c from world 3129234 to dev "naa.#############################" failed H:0x0 D:0x8 P:0x0

Environment

VMware vSphere ESXi 7.x
VMware vSphere ESXi 8.x

Cause

  • Performance degradation events were logged for the impacted devices in /var/log/vmkernel.log, showing significant increases in I/O latency:
    performance has deteriorated. I/O latency increased from average value of 6661 microseconds to 826700 microseconds.

The failed (aborted) I/O leads to failed datastore heartbeats, and (after 16 seconds approximately) to datastore heartbeat timeouts:

e.g.
/var/log/vobd.log:
cpu##:2111626) HBX: 3063: '<datastore>': HB at offset 3702784 - Waiting for timed out HB:
cpu##:2111626) [HB state abcdef02 offset 3702784 gen 261 stampUS 2658722121357 uuid <datastore UUID> jrnl <FB 7> drv 24.82 lockImpl 4 ip ##.##.##.##]


On heartbeat timeout, the datastore is marked offline by ESXi until heartbeat is again successful. 

Resolution

Engage storage and fabric vendor support to investigate the cause of the I/O aborts.

Additional Information