Vitrual Machine freezes for few minutes when the datastore on which vmx file resides goes into APD state.
search cancel

Vitrual Machine freezes for few minutes when the datastore on which vmx file resides goes into APD state.

book

Article ID: 422661

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms : 

  • Login to SSH session for the impacted host. Navigating to affected VM confirms that the VMDKs are residing on datastore1 and datastore2
    less vmfs/volumes/DATASTORE-UUID/VM_NAME/VM_NAME.vmx
    scsi0:0.fileName = "/vmfs/volumes/DATASTORE1-UUID/VM_NAME/VM_NAME.vmdk"   
    scsi0:1.fileName = "vmfs/volumes/DATASTORE2-UUID/VM_NAME/VM_NAME.vmdk"   

  • Running the below command confirms that the VMX file resides on datastore3
    esxcli vm process list
    /vmfs/volumes/DATASTORE3-UUID/VM_NAME/VM_NAME.vmx  
  • Virtual machine freeze for few minutes. 
  • The datastore 3 reports that it entered all paths down state during the same time. 

Environment

  • VMware vSphere ESXi 7.x
  • VMware vSphere ESXi 8.x

Cause

  • The datastore hosting the VMX file dropped write commands and the entered an All Paths Down (APD) state.
  • At the same time vSphere HA tried to migrate the VM which resulted in VM freeze. 

Cause validation: 

The following sequence of events (timestamps are examples) demonstrates the behavior:
  • At 19:30:01 and 19:30:12 (UTC): 
    • The var/run/log/hostd.log file confirmed, vSphere HA task created.
      2025-12-11T19:30:31.036Z In(166) Hostd[2100543]: [Originator@6876 sub=Vimsvc.TaskManager opID=578846f6-64-4a77 sid=528d987b user=vpxuser:####\##-###-###########-#] Task Created : haTask-ha-host-vim.host.NetworkSystem.queryNetworkHint-1462936878
    • The var/run/log/vmkernel.log file reported write operation failures on the affected datastore with multiple SCSI error codes, including H:0x5 and D:0x8. At this point, the datastore was not fully offline, so the ESXi host continues attempting to access and update the VMX file. 
      2025-12-11T19:30:01.146Z In(182) vmkernel: cpu136:2099445)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x2a (0x45db2d3c7600, 2097352) to dev "naa.################################" on path "vmhba5:C0:T0:L3" Failed:
      2025-12-11T19:30:01.146Z In(182) vmkernel: cpu136:2099445)NMP: nmp_ThrottleLogForDevice:3898: H:0x5 D:0x0 P:0x0 . Act:EVAL. cmdId.initiator=0x430be3480240 CmdSN 0x11252d
      2025-12-11T19:30:12.105Z In(182) vmkernel: cpu64:2099444)ScsiDeviceIO: 4633: Cmd(0x45bb72b30240) 0x2a, CmdSN 0x11252e from world 2097352 to dev "naa.################################" failed H:0x2 D:0x8 P:0x0

  • At 19:30:34 and 19:30:35 (UTC):
    • The var/run/log/vmkernel.log file reported that the affected datastore entered into APD state, But host will try to access the datastore for next 140 seconds until the timer expires. During this time host will try to fetch the VMX on the affected datastore to update the metadata changes.
      2025-12-11T19:30:34.319Z In(182) vmkernel: cpu54:2098414)StorageApdHandlerEv: 106: Device or filesystem with identifier [naa.################################] has entered the All Paths Down state.
    • The var/run/log/hostd.log file confirmed that the "datastore change received" and the "datastore has gone into disconnected state".
      2025-12-11T19:30:35.327Z In(166) Hostd[2100543]: [Originator@6876 sub=Vslmsvc.DatastoreStateListener] Received Datastore Change
      2025-12-11T19:30:35.327Z In(166) Hostd[2100543]: [Originator@6876 sub=Vslmsvc.DatastoreStateListener] Datastore /vmfs/volumes/########-########-####-############/ has gone into disconnected state
      2025-12-11T19:30:35.328Z In(166) Hostd[2100543]: [Originator@6876 sub=Vslmsvc.DatastoreStateListener] Catalog on datastore /vmfs/volumes/########-########-####-############ has lost connectivity

  • At 19:32:57 (UTC): 
    • The var/run/log/hostd.log file confirmed the heartbeat of the vmx set to red.
      2025-12-11T19:32:57.325Z In(166) Hostd[2100543]: [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/########-########-####-############/VM_NAME/VM_NAME.vmx] Setting heartbeat to red; Heartbeat (in 154s): expected=154 (yellow<=80%, red<=40%), actual=10 (6.49351%)
      2025-12-11T19:32:57.325Z Db(167) Hostd[2100543]: [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/########-########-####-############/VM_NAME/VM_NAME.vmx] Updating current heartbeatStatus: green -> red

  • At 19:32:54 and 19:33:27(UTC):
    • The var/run/log/vmkernel.log file reported that the datastore marked as All Paths Down (APD) timeout. At this state, when the timer expires then vmx stops blocking operations.
      2025-12-11T19:32:54.318Z In(182) vmkernel: cpu66:2098414)StorageApdHandlerEv: 120: Device or filesystem with identifier [naa.################################] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will $
    • The var/run/log/hostd.log file confirmed that the vmx file heartbeat set back to green.
      2025-12-11T19:33:27.325Z Db(167) Hostd[2100541]: [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/########-########-####-############/VM_NAME/VM_NAME.vmx] Updating current heartbeatStatus: red -> green

  • If ESXi kept trying after timer expires, then
      • VM would hang for hours
      • Guest would be unusable.

Resolution

Involve storage vendor to investigate on the "APD" reported.