vSphere HA did not terminate VM xxxxxxxx affected by an inaccessible datastore on host xxxxxxxx in cluster xxxxxxxx in xxxxxxxx : not enough resources to restart the VM on another host VMware vSphere 7.x
VMware vSphere 8.x
Environments with multiple hosts and VMs
Applicable to various storage configurations (local, shared, SAN, NAS)
The impacted datastore was shared on all ESXi hosts, and all the ESXi hosts in the cluster were impacted by a transient storage connectivity issue.
Steps to validate the issue:
[root@Example-Host1:] vmkfstools -Pv10 -h /vmfs/volumes/xxxxxxxx-xxxxxxxx-1234-xxxxxxxxxxxxVMFS-6.82 (Raw Major Version: 24) file system spanning 1 partitions.File system label (if any): test-datastore1Mode: public ATS-onlyCapacity 20 TB, 8.2 TB available, file block size 1 MB, max supported file size 64 TBVolume Creation Time: Thu Jun 10 05:47:48 2021Files (max/free): 16384/15656Ptr Blocks (max/free): 0/0Sub Blocks (max/free): 26624/24757Secondary Ptr Blocks (max/free): 256/255File Blocks (overcommit/used/overcommit %): 0/12321734/0Ptr Blocks (overcommit/used/overcommit %): 0/0/0Sub Blocks (overcommit/used/overcommit %): 0/1867/0Large File Blocks (total/used/file block clusters): 40960/4892/19312Volume Metadata size: 2530082816Disk Block Size: 512/512/0UUID: xxxxxxxx-xxxxxxxx-4321-xxxxxxxxxxxxLogical device: xxxxxxxx-xxxxxxxx-xxxx-xxxxxxxxxxxxPartitions spanned (on "lvm"): naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12:1 Unable to connect to vaai-nasd socket [No such file or directory]Is Native Snapshot Capable: NOOBJLIB-LIB: ObjLib cleanup done.WORKER: asyncOps=0 maxActiveOps=0 maxPending=0 maxCompleted=0[root@Example-Host1:]
vmkernel.log on Host Example-Host1 indicates there are APD_START and APD_EXIT events, which indicates there was a transient storage connectivity issue for naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12 during the issue reported time, and this behaviour is the same for all the ESXi hosts in the Cluster.
2025-04-14T20:28:55.665Z cpu12:2097744)LVM: 6273: Received APD EventType: APD_START (3) for device <naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12:1> (gen 1)2025-04-14T20:28:55.665Z cpu12:2097744)LVM: 5861: Handling APD EventType: APD_START (3) for device <naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12:1> (unlocked, gen 1, cur a pd state UNKNOWN, cur dev state 1)2025-04-14T20:28:55.665Z cpu12:2097744)ScsiDevice: 5566: Device state of naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12 set to APD_START; token num:12025-04-14T20:28:55.665Z cpu12:2097744)StorageApdHandler: 1191: APD start for 0x430a40d1ac50 [naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12]2025-04-14T20:28:55.665Z cpu0:2097742)StorageApdHandler: 408: APD start event for 0x430a40d1ac50 [naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12]2025-04-14T20:28:55.666Z cpu0:2097742)StorageApdHandlerEv: 110: Device or filesystem with identifier [naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12] has entered the All Pat hs Down state.2025-04-14T20:28:55.666Z cpu4:2098189)WARNING: NMP: nmpDeviceAttemptFailover:722: Retry world failover device "naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12" - failed to is sue command due to Not found (APD), try again...2025-04-14T20:31:15.668Z cpu18:2097742)LVM: 6273: Received APD EventType: APD_TIMEOUT (5) for device <naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12:1> (gen 2)2025-04-14T20:31:15.668Z cpu18:2097742)LVM: 5861: Handling APD EventType: APD_TIMEOUT (5) for device <naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12:1> (unlocked, gen 2, cur apd state APD_START, cur dev state 1)2025-04-14T20:31:15.668Z cpu18:2097742)StorageApdHandler: 606: APD timeout event for 0x430a40d1ac50 [naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12]2025-04-14T20:31:15.668Z cpu18:2097742)StorageApdHandlerEv: 126: Device or filesystem with identifier [naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12] has entered the All Pa ths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will $2025-04-14T20:34:29.726Z cpu63:169873824)LVM: 6273: Received APD EventType: APD_EXIT (4) for device <naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12:1> (gen 3)2025-04-14T20:34:29.726Z cpu63:169873824)LVM: 5861: Handling APD EventType: APD_EXIT (4) for device <naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12:1> (unlocked, gen 3, cur apd state APD_TIMEOUT, cur dev state 1)2025-04-14T20:34:29.726Z cpu63:169873824)ScsiDevice: 5620: Device naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12 is Out of APD; token num:12025-04-14T20:38:25.408Z cpu37:2097471)LVM: 5861: Handling APD EventType: APD_EXIT (4) for device <naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12:1> (locked, gen 3, cur apd state APD_TIMEOUT, cur dev state 1)2025-04-14T20:40:43.697Z cpu5:2097742)StorageApdHandler: 501: APD exit event for 0x430a40d1ac50 [naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12, 0]2025-04-14T20:40:43.697Z cpu5:2097742)StorageApdHandlerEv: 117: Device or filesystem with identifier [naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12] has exited the All Path s Down state.
KB # 371663 clearly indicates that in vSphere environments, High Availability (HA) may not restart all virtual machines (VMs) when an ESXi host experiences an outage.
Contact your storage vendor to investigate the APD issue.