vSphere HA did not terminate this VM affected by an inaccessible datastore: not enough resources to restart the VM on another host

Products

VMware vCenter Server

Issue/Introduction

Symptoms :

VM was not restarted by HA during storage connectivity issue and failed with below error.

vSphere HA did not terminate VM xxxxxxxx affected by an inaccessible datastore on host xxxxxxxx in cluster xxxxxxxx in xxxxxxxx : not enough resources to restart the VM on another host

Event Type Description: This event is logged when a VM affected by an inaccessible datastore in a vSphere HA cluster was not terminated.

Environment

VMware vSphere 7.x
VMware vSphere 8.x

Environments with multiple hosts and VMs
Applicable to various storage configurations (local, shared, SAN, NAS)

Cause

The impacted datastore was shared on all ESXi hosts, and all the ESXi hosts in the cluster were impacted by a transient storage connectivity issue.

Steps to validate the issue:

Connect to the ESXi host and run df -h and collect the datastore path
Once we have the path, find the datastore device backing details which will help in checking APD events in vmkernel.log to check if there was transinent storage condition for the mentioned device.

[root@Example-Host1:] vmkfstools -Pv10 -h /vmfs/volumes/xxxxxxxx-xxxxxxxx-1234-xxxxxxxxxxxx
VMFS-6.82 (Raw Major Version: 24) file system spanning 1 partitions.
File system label (if any): test-datastore1
Mode: public ATS-only
Capacity 20 TB, 8.2 TB available, file block size 1 MB, max supported file size 64 TB
Volume Creation Time: Thu Jun 10 05:47:48 2021
Files (max/free): 16384/15656
Ptr Blocks (max/free): 0/0
Sub Blocks (max/free): 26624/24757
Secondary Ptr Blocks (max/free): 256/255
File Blocks (overcommit/used/overcommit %): 0/12321734/0
Ptr Blocks (overcommit/used/overcommit %): 0/0/0
Sub Blocks (overcommit/used/overcommit %): 0/1867/0
Large File Blocks (total/used/file block clusters): 40960/4892/19312
Volume Metadata size: 2530082816
Disk Block Size: 512/512/0
UUID: xxxxxxxx-xxxxxxxx-4321-xxxxxxxxxxxx
Logical device: xxxxxxxx-xxxxxxxx-xxxx-xxxxxxxxxxxx
Partitions spanned (on "lvm"):
naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12:1
Unable to connect to vaai-nasd socket [No such file or directory]
Is Native Snapshot Capable: NO
OBJLIB-LIB: ObjLib cleanup done.
WORKER: asyncOps=0 maxActiveOps=0 maxPending=0 maxCompleted=0
[root@Example-Host1:]

vmkernel.log on Host Example-Host1 indicates there are APD_START and APD_EXIT events, which indicates there was a transient storage connectivity issue for naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12 during the issue reported time, and this behaviour is the same for all the ESXi hosts in the Cluster.

2025-04-14T20:28:55.665Z cpu12:2097744)LVM: 6273: Received APD EventType: APD_START (3) for device <naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12:1> (gen 1)
2025-04-14T20:28:55.665Z cpu12:2097744)LVM: 5861: Handling APD EventType: APD_START (3) for device <naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12:1> (unlocked, gen 1, cur a pd state UNKNOWN, cur dev state 1)
2025-04-14T20:28:55.665Z cpu12:2097744)ScsiDevice: 5566: Device state of naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12 set to APD_START; token num:1
2025-04-14T20:28:55.665Z cpu12:2097744)StorageApdHandler: 1191: APD start for 0x430a40d1ac50 [naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12]
2025-04-14T20:28:55.665Z cpu0:2097742)StorageApdHandler: 408: APD start event for 0x430a40d1ac50 [naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12]
2025-04-14T20:28:55.666Z cpu0:2097742)StorageApdHandlerEv: 110: Device or filesystem with identifier [naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12] has entered the All Pat hs Down state.
2025-04-14T20:28:55.666Z cpu4:2098189)WARNING: NMP: nmpDeviceAttemptFailover:722: Retry world failover device "naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12" - failed to is sue command due to Not found (APD), try again...
2025-04-14T20:31:15.668Z cpu18:2097742)LVM: 6273: Received APD EventType: APD_TIMEOUT (5) for device <naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12:1> (gen 2)
2025-04-14T20:31:15.668Z cpu18:2097742)LVM: 5861: Handling APD EventType: APD_TIMEOUT (5) for device <naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12:1> (unlocked, gen 2, cur apd state APD_START, cur dev state 1)
2025-04-14T20:31:15.668Z cpu18:2097742)StorageApdHandler: 606: APD timeout event for 0x430a40d1ac50 [naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12]
2025-04-14T20:31:15.668Z cpu18:2097742)StorageApdHandlerEv: 126: Device or filesystem with identifier [naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12] has entered the All Pa ths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will $
2025-04-14T20:34:29.726Z cpu63:169873824)LVM: 6273: Received APD EventType: APD_EXIT (4) for device <naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12:1> (gen 3)
2025-04-14T20:34:29.726Z cpu63:169873824)LVM: 5861: Handling APD EventType: APD_EXIT (4) for device <naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12:1> (unlocked, gen 3, cur apd state APD_TIMEOUT, cur dev state 1)
2025-04-14T20:34:29.726Z cpu63:169873824)ScsiDevice: 5620: Device naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12 is Out of APD; token num:1
2025-04-14T20:38:25.408Z cpu37:2097471)LVM: 5861: Handling APD EventType: APD_EXIT (4) for device <naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12:1> (locked, gen 3, cur apd state APD_TIMEOUT, cur dev state 1)
2025-04-14T20:40:43.697Z cpu5:2097742)StorageApdHandler: 501: APD exit event for 0x430a40d1ac50 [naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12, 0]
2025-04-14T20:40:43.697Z cpu5:2097742)StorageApdHandlerEv: 117: Device or filesystem with identifier [naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx12] has exited the All Path s Down state.

Resolution

KB # 371663 clearly indicates that in vSphere environments, High Availability (HA) may not restart all virtual machines (VMs) when an ESXi host experiences an outage.

Contact your storage vendor to investigate the APD issue.

Additional Information

Refer: vSphere HA Fails to Restart Some VMs During Host Outages