Vitrual Machine freezes for few minutes when the datastore on which vmx file resides goes into APD state.

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms :

Login to SSH session for the impacted host. Navigating to affected VM confirms that the VMDKs are residing on datastore1 and datastore2
less vmfs/volumes/DATASTORE-UUID/VM_NAME/VM_NAME.vmx
scsi0:0.fileName = "/vmfs/volumes/DATASTORE1-UUID/VM_NAME/VM_NAME.vmdk"
scsi0:1.fileName = "vmfs/volumes/DATASTORE2-UUID/VM_NAME/VM_NAME.vmdk"
Running the below command confirms that the VMX file resides on datastore3
esxcli vm process list
/vmfs/volumes/DATASTORE3-UUID/VM_NAME/VM_NAME.vmx
Virtual machine freeze for few minutes.
The datastore 3 reports that it entered all paths down state during the same time.

Environment

VMware vSphere ESXi 7.x
VMware vSphere ESXi 8.x

Cause

The datastore hosting the VMX file dropped write commands and the entered an All Paths Down (APD) state.
At the same time vSphere HA tried to migrate the VM which resulted in VM freeze.

Cause validation:

The following sequence of events (timestamps are examples) demonstrates the behavior:

At 19:30:01 and 19:30:12 (UTC):
- The var/run/log/hostd.log file confirmed, vSphere HA task created.
  2025-12-11T19:30:31.036Z In(166) Hostd[2100543]: [Originator@6876 sub=Vimsvc.TaskManager opID=578846f6-64-4a77 sid=528d987b user=vpxuser:####\##-###-###########-#] Task Created : haTask-ha-host-vim.host.NetworkSystem.queryNetworkHint-1462936878
- The var/run/log/vmkernel.log file reported write operation failures on the affected datastore with multiple SCSI error codes, including H:0x5 and D:0x8. At this point, the datastore was not fully offline, so the ESXi host continues attempting to access and update the VMX file.
  2025-12-11T19:30:01.146Z In(182) vmkernel: cpu136:2099445)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x2a (0x45db2d3c7600, 2097352) to dev "naa.################################" on path "vmhba5:C0:T0:L3" Failed:
  2025-12-11T19:30:01.146Z In(182) vmkernel: cpu136:2099445)NMP: nmp_ThrottleLogForDevice:3898: H:0x5 D:0x0 P:0x0 . Act:EVAL. cmdId.initiator=0x430be3480240 CmdSN 0x11252d
  2025-12-11T19:30:12.105Z In(182) vmkernel: cpu64:2099444)ScsiDeviceIO: 4633: Cmd(0x45bb72b30240) 0x2a, CmdSN 0x11252e from world 2097352 to dev "naa.################################" failed H:0x2 D:0x8 P:0x0
At 19:30:34 and 19:30:35 (UTC):
- The var/run/log/vmkernel.log file reported that the affected datastore entered into APD state, But host will try to access the datastore for next 140 seconds until the timer expires. During this time host will try to fetch the VMX on the affected datastore to update the metadata changes.
  
  2025-12-11T19:30:34.319Z In(182) vmkernel: cpu54:2098414)StorageApdHandlerEv: 106: Device or filesystem with identifier [naa.################################] has entered the All Paths Down state.
- The var/run/log/hostd.log file confirmed that the "datastore change received" and the "datastore has gone into disconnected state".
  2025-12-11T19:30:35.327Z In(166) Hostd[2100543]: [Originator@6876 sub=Vslmsvc.DatastoreStateListener] Received Datastore Change
  2025-12-11T19:30:35.327Z In(166) Hostd[2100543]: [Originator@6876 sub=Vslmsvc.DatastoreStateListener] Datastore /vmfs/volumes/########-########-####-############/ has gone into disconnected state
  2025-12-11T19:30:35.328Z In(166) Hostd[2100543]: [Originator@6876 sub=Vslmsvc.DatastoreStateListener] Catalog on datastore /vmfs/volumes/########-########-####-############ has lost connectivity
At 19:32:57 (UTC):
- The var/run/log/hostd.log file confirmed the heartbeat of the vmx set to red.
  2025-12-11T19:32:57.325Z In(166) Hostd[2100543]: [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/########-########-####-############/VM_NAME/VM_NAME.vmx] Setting heartbeat to red; Heartbeat (in 154s): expected=154 (yellow<=80%, red<=40%), actual=10 (6.49351%)
  2025-12-11T19:32:57.325Z Db(167) Hostd[2100543]: [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/########-########-####-############/VM_NAME/VM_NAME.vmx] Updating current heartbeatStatus: green -> red
At 19:32:54 and 19:33:27(UTC):
- The var/run/log/vmkernel.log file reported that the datastore marked as All Paths Down (APD) timeout. At this state, when the timer expires then vmx stops blocking operations.
  2025-12-11T19:32:54.318Z In(182) vmkernel: cpu66:2098414)StorageApdHandlerEv: 120: Device or filesystem with identifier [naa.################################] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will $
- The var/run/log/hostd.log file confirmed that the vmx file heartbeat set back to green.
  2025-12-11T19:33:27.325Z Db(167) Hostd[2100541]: [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/########-########-####-############/VM_NAME/VM_NAME.vmx] Updating current heartbeatStatus: red -> green
If ESXi kept trying after timer expires, then
- - VM would hang for hours
  - Guest would be unusable.

Resolution

Involve storage vendor to investigate on the "APD" reported.

Vitrual Machine freezes for few minutes when the datastore on which vmx file resides goes into APD state.

Article ID: 422661

Updated On:

Products

Issue/Introduction

Symptoms :

Environment

Cause

Cause validation:

Resolution

Feedback