VM Unresponsive Due to Datastore Connectivity and I/O Stalls
search cancel

VM Unresponsive Due to Datastore Connectivity and I/O Stalls

book

Article ID: 410368

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • VM became unresponsive and appeared hung.

  • vmware.log confirms a hard CPU reset due to prolonged I/O stalls:

2###-0#-0#T10:32:01.461Z In(05) vcpu-0 - Checkpoint_Unstun: vm stopped for 6948 us
2###-0#-0#T10:32:01.461Z In(05) vcpu-0 - CPU reset: hard (mode Emulation)
2###-0#-0#T10:32:01.461Z In(05) vcpu-1 - CPU reset: hard (mode Emulation)

  • In the var/run/log/hostd.log file, similar entries are seen:

    2###-0#-0#T10:00:04.273Z info hostd[4277801] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 350617 : Lost access to volume 5f3a7daf-########-####-0090fada2b68 (Datastore1) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

Environment

VMware ESXi 7.x

Cause

The VM hang was caused by transient storage path instability that triggered:

  • SCSI command aborts and retries
  • VMFS heartbeat timeouts indicating temporary datastore access loss
  • FCoE fabric-level anomalies (invalid RPI retries, abort failures)
  • HBA/driver command aborts at link/session level

The prolonged I/O stalls at the storage layer propagated to the VM, causing the ESXi scheduler to issue CPU hard resets.

Validation

  • VMFS heartbeat logs confirm timeout and later recovery:

2###-0#-0#T10:00:04.272Z: [vmfsCorrelator] 52944128166192us: [esx.problem.vmfs.heartbeat.timedout] 5f3a7daf-########-####-0090fada2b68 Datastore1
2###-0#-0#T10:00:04.278Z: [vmfsCorrelator] 52943286477311us: [vob.vmfs.heartbeat.recovered] Reclaimed heartbeat for volume 5f3a7daf-########-####-0090fada2b68 (Datastore1): [Timeout] [HB state abcdef02 offset 4050944 gen 46153 stampUS 52943286477004 uuid 6596cc43-########-####-0090fada2b28 jrnl <FB 6792000> drv 14.81]

  • SCSI error codes and host flags (H:0x5, H:0x8, H:0xc) show host-level timeouts, not guest OS faults.

2###-0#-0#T10:00:01.902Z cpu41:2098225)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.###" state in doubt; requested fast path state update...

2###-0#-0#T10:00:04.272Z cpu67:2098227)NMP: nmp_ThrottleLogForDevice:3867: Cmd 0x89 (0x45d92bc98388, 21894817) to dev "naa.###" on path "vmhba2:C0:T2:L0" Failed:
2###-0#-0#T10:00:04.272Z cpu67:2098227)NMP: nmp_ThrottleLogForDevice:3875: H:0x8 D:0x0 P:0x0 . Act:EVAL. cmdId.initiator=0x430794e64800 CmdSN 0x27d5a72

2###-0#-0#T10:15:53.511Z cpu49:2098225)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.###" state in doubt; requested fast path state update...
2###-0#-0#T10:15:53.511Z cpu49:2098225)ScsiDeviceIO: 4124: Cmd(0x45d91b3a9fc8) 0x2a, CmdSN 0x359 from world 21827278 to dev "naa.####" failed H:0x2 D:0x8 P:0x0

2###-0#-0#T10:15:53.525Z cpu49:2098225)NMP: nmp_ThrottleLogForDevice:3867: Cmd 0x2a (0x45d91b3a9fc8, 21827278) to dev "naa.####" on path "vmhba2:C0:T2:L1" Failed:
2###-0#-0#T10:15:53.525Z cpu49:2098225)NMP: nmp_ThrottleLogForDevice:3875: H:0xc D:0x0 P:0x0 . Act:NONE. cmdId.initiator=0x430bf5453780 CmdSN 0x359

  • FCoE logs confirm abort failures and invalid RPI conditions.

2###-0#-0#T10:00:04.480Z cpu29:2097576)brcmfcoe: lpfc_handle_status:5079: 3:(0):3271: FCP cmd x89 failed <2/0> sid x011852, did x012c01, oxid xa13 iotag x54c SCSI Chk Cond - 0xe: Data(x2:xe:x1d:x0)

2###-0#-0#T10:00:08.327Z cpu59:21888071)WARNING: brcmfcoe: lpfc_sli_issue_abort:10358: 1:(0):3169 Abort failed: Abort INP: Data: xb43 x67c x1004 x98

2###-0#-0#T10:00:08.327Z cpu3:2098115)brcmfcoe: lpfc_handle_status:5079: 1:(0):3271: FCP cmd x89 failed <0/0> sid x011801, did x012f01, oxid xb43 iotag x67c Abort Requested Host Abort Req

2###-0#-0#T10:15:53.511Z cpu3:2546250)brcmfcoe: lpfc_handle_status:5079: 1:(0):3271: FCP cmd x2a failed <2/1> sid x011801, did x012c01, oxid x921 iotag x45a Invalid RPI Host Retry

2###-0#-0#T10:47:59.109Z cpu35:2098129)brcmfcoe: lpfc_sli4_async_fip_evt:5721: 2:(0):2788 FCF param modified event, evt_tag:xc998d9, index:x0
2###-0#-0#T10:48:00.231Z cpu21:2098115)brcmfcoe: lpfc_els_unsol_buffer:7818: 1:(0):3717 LOGO received from NPORT x12c01 state x7 Data: x20 x800220 x11801 x11801

2###-0#-0#T10:48:00.235Z cpu3:20622937)brcmfcoe: lpfc_handle_status:5079: 1:(0):3271: FCP cmd x2a failed <2/1> sid x011801, did x012c01, oxid x8d0 iotag x409 Invalid RPI Host Retry

Resolution

Engage Storage and Fabric vendor

Additional Information