Error: "Some of the disks of the virtual machine failed to load" and VM registration failure after storage connectivity loss

Products

VMware vSphere ESXi VMware vCenter Server

Issue/Introduction

Virtual machines (VMs) display the error: "Some of the disks of the virtual machine [VM_NAME] failed to load. The information present for them in the virtual machine configuration may be incomplete."
Attempts to remove and re-register the VM from the .vmx file fail with: "The name is already in use."
Typical symptoms include the ESXi host experiencing a storage connectivity loss or All Paths Down (APD) event.
In the ESXi /var/run/log/vmkernel.log file, there may be entries similar to:
YYYY-MM-DD HH:MM:SS cpu1:2049)WARNING: NMP: nmp_IssueCommandToDevice:2954:I/O could not be issued to device "naa.60##############3" due to Not found
YYYY-MM-DD HH:MM:SS cpu1:2049)WARNING: NMP: nmp_DeviceRetryCommand:133:Device "naa.60##############3": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
YYYY-MM-DD HH:MM:SS cpu1:2049)WARNING: NMP: nmp_DeviceStartLoop:721:NMP Device "naa.60##############3" is blocked. Not starting I/O from device.
YYYY-MM-DD HH:MM:SS cpu1:2642)WARNING: NMP: nmpDeviceAttemptFailover:599:Retry world failover device "naa.60##############3" - issuing command 0x4124007ba7c0
YYYY-MM-DD HH:MM:SS cpu1:2642)WARNING: NMP: nmpDeviceAttemptFailover:658:Retry world failover device "naa.60##############3" - failed to issue command due to Not found (APD), try again...
YYYY-MM-DD HH:MM:SS cpu1:2642)WARNING: NMP: nmpDeviceAttemptFailover:708:Logical device "naa.60##############3": awaiting fast path state update...
YYYY-MM-DD HH:MM:SS fdm Db(###) Fdm[########]: [Originator@#### sub=Invt opID=placementService.cpp:###-########] Host host-####### cannot access VM's home datastore: /vmfs/volumes/########-########
This error may occur in ESXi /var/run/log/vobd.log :

YYYY-MM-DD HH:MM:SS [APDCorrelator] 2682686563317us: [esx.problem.storage.apd.timeout] Device or filesystem with identifier [########-########] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.
Receiving the following event message when a storage device connected to the ESXi host enters the All Paths Down (APD) state:

YYYY-MM-DD HH:MM:SS cpu4:8598)StorageApdHandler: 692: APD Handle Created with lock.

Note:

Above log messages indicate that the system had an APD event but does not mean it is currently in the APD state. This message will be seen at boot time of the host.
The messages indicate that the system has turned on a timer that allows the ESXi host to continue retrying attempts to re-establish connectivity with the device for a limited time period.
By default, the APD timeout is set to 140 seconds.
The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on the environment.

Environment

VMware vSphere ESXi 8.x
VMware vCenter Server 8.x

Cause

This issue occurs when a storage connectivity loss (APD condition) causes the ESXi host to retain stale storage "worlds" (processes) and file locks. Even after storage connectivity is restored, these orphaned processes prevent the host from correctly reading the VM configuration files or allowing a new registration of the same VM name.

Resolution

Follow these steps to manually clear the stale processes:

1. Run the following command on the ESXi host to find the device ID for the affected datastore:

esxcfg-scsidevs -m

2. Using the device ID (naa.xxxx) identified in step 1, list the running processes (worlds) associated with that device:

esxcli storage core device world list -d naa.##############

3. Identify the World ID for the affected VM and manually kill the process:

kill -9 [WorldID]

4. Rescan the storage adapters and refresh the VMFS volumes to clear any remaining stale locks:

esxcli storage core adapter rescan --all vmkfstools -V

5. Remove the orphaned VM from the vCenter inventory (if still present) and re-register the VM using the .vmx file.

6. If the above steps to kill the process fail, all affected ESXi hosts may require a reboot to remove any residual references to the affected devices that are in an APD state.

Reference: All Paths Down for a storage device