Virtual machines experiences performance issues and vMotion failures.

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

Virtual machines experience performance issues.
Virtual machines may get into hung state when vMotion is initiated with below errors:

In /var/run/log/vmkwarning.log

2025-12-03T23:50:33.006Z Wa(180) vmkwarning: cpu61:50092835)WARNING: Migrate: 7074:#######124 S: Migration considered a failure by the VMX. It is most likely a timeout, but check the VMX log for the true error.
2025-12-03T23:50:33.006Z Wa(180) vmkwarning: cpu61:50092835)WARNING: Migrate: 257: #######124 S:Failed: Migration determined a failure by the VMX #######
2025-12-03T23:50:33.568Z Wa(180) vmkwarning: cpu25:50096377)WARNING: Migrate: 8504: ######124 S: Migration failed. Updating the stat- 34 anyway.

In vMotion fails as below in /var/run/log/hostd.log

2025-12-03T23:50:33.570Z In(166) Hostd[2100157]: [Originator@6876 sub=Vcsvc.VMotionSrc.#######124] ResolveCb: VMX reports needsUnregister = false for migrateType MIGRATE_TYPE_VMOTION
2025-12-03T23:50:33.570Z In(166) Hostd[2100157]: [Originator@6876 sub=Vcsvc.VMotionSrc.#######124] ResolveCb: Failed with fault: (vim.fault.GenericVmConfigFault) {
2025-12-03T23:50:33.570Z In(166) Hostd[2099769]: --> faultMessage = (vmodl.LocalizableMessage) [
2025-12-03T23:50:33.570Z In(166) Hostd[2099769]: --> (vmodl.LocalizableMessage) {
2025-12-03T23:50:33.570Z In(166) Hostd[2099769]: --> key = "msg.checkpoint.migration.maxSwitchoverTimeExceeded",
2025-12-03T23:50:33.570Z In(166) Hostd[2099769]: --> arg = (vmodl.KeyAnyValue) [
2025-12-03T23:50:33.570Z In(166) Hostd[2099769]: --> (vmodl.KeyAnyValue) {
2025-12-03T23:50:33.570Z In(166) Hostd[2099769]: --> key = "1",
2025-12-03T23:50:33.570Z In(166) Hostd[2099769]: --> value = "100"
2025-12-03T23:50:33.570Z In(166) Hostd[2099769]: --> }
2025-12-03T23:50:33.570Z In(166) Hostd[2099769]: --> ],
2025-12-03T23:50:33.570Z In(166) Hostd[2099769]: --> message = "The migration has exceeded the maximum switchover time of 100 second(s). ESX has preemptively failed the migration to allow the VM to continue running on the source. To avoid this failure, either increase the maximum allowable switchover time or wait until the VM is performing a less intensive workload.

Multiple ESXi hosts may experience the vMotion issues.
ESXi hosts receives below SCSI sense code and these are observed on /var/run/log/vmkernel.log

1. SCSI sense code H:0x5 - Indicates that the driver has to abort commands in-flight to the target.
2. There may be "state in doubt; requested fast path state update" messages seen on the

2025-12-03T23:50:28.526Z In(182) vmkernel: cpu46:8162812)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x28 (0x#####, 2099672) to dev "naa.#####################0025" on path "vmhba64:C1:T10:L25" Failed:
2025-12-03T23:50:28.526Z In(182) vmkernel: cpu46:8162812)NMP: nmp_ThrottleLogForDevice:3898: H:0x5 D:0x0 P:0x0 . Act:EVAL. cmdId.initiator=0x####### CmdSN 0x4
2025-12-03T23:50:28.526Z Wa(180) vmkwarning: cpu46:8162812)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:235: NMP device "naa.#################0025" state in doubt; requested fast path state update...

Environment

VMware vSphere ESXi 8.x

Cause

This issue is caused due to underlying storage issues. Problematic storage device can cause vMotion failures and can consume excessive hostd (ESXi host management service) resources leading to VM operational issues.

/var/run/log/vobd.log reports power on resets and logs also indicate that this could be due to storage problem.

2025-12-03T23:50:28.526Z In(182) vmkernel: cpu57:2098108)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x28 (0x#####, #####) to dev "naa.###################0025" on path "vmhba64:C1:T10:L25" Failed:
2025-12-03T23:50:28.526Z In(182) vmkernel: cpu57:2098108)NMP: nmp_ThrottleLogForDevice:3898: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0. Act:NONE. cmdId.initiator=0x##### CmdSN 0x5

2025-11-29T09:06:07.404Z In(14) vobd[2097812]: [scsiCorrelator] 10400609837591us: [vob.scsi.scsipath.por] Power-on Reset occurred on naa.#############0025
2025-11-29T09:06:07.414Z In(14) vobd[2097812]: [scsiCorrelator] 10401078765499us: [esx.problem.storage.connectivity.devicepor] Frequent PowerOn Reset Unit Attentions are occurring on device naa.###############0025. This may indicate a storage problem. Affected datastores: "###########".

ATS failures are seen in /var/run/log/vmkernel.log. ATS I/O failures can cause the hostd service of ESXi host going degraded leading to operations like vmotion , snapshot etc failures .

2025-12-04T10:45:47.956Z ####### vmkernel: cpu32:2097961)ScsiDeviceIO: 4686: Cmd(0x########) 0x89, CmdSN 0xe3 from world ##### to dev ""naa.###############0025"" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0

When a storage device is experiencing issues (like latency, dropped frames, drive failures or hardware errors), the ESXi host attempts to manage the I/O requests for that device. This leads to the following chain of events:

Excessive Resource Consumption by hostd: The hostd service is responsible for managing the ESXi host, including monitoring storage paths and handling device I/O failures. When a device becomes problematic, hostd consumes more resources attempts to:

Re-try I/O commands to the faulty device.
Perform path probing and management operations on paths that are dead or in doubt.
Process and log continuous SCSI errors, I/O aborts, and heartbeat timeouts.

Impact on Host Performance and vMotion: Resource exhaustion can overload the hostd service and the host's CPU, impacting overall host performance. Since vMotion operations rely heavily on stable host communication and resource availability, the resulting performance degradation and command queuing can cause vMotion to fail or timeout.
ATS miscompare or failures and I/O aborts: Is a storage-level contention that can be triggered by problematic LUN experiencing excessive latency.

Resolution

1. As a temporary relief or fix, Detach or remove the problematic storage device from the ESXi host where the LUN is connected but not consumed. Perform a storage level rescan so that changes gets updated at cluster level.

Performing a rescan of the storage on an ESXi host

Note:

Detaching a non-consumed LUN from a host and performing a rescan on the single host can remove the device and stop the host from continuously probing the inaccessible paths. This frees up hostd resources which can resolve operational issues like vMotion failures and VM sluggishness.
Rescan alone is insufficient if there are underlying storage issues such as a drive failure , high load or any other issues on storage array.

2. As a permanent fix , Kindly engage the storage vendor to further investigate the underlying storage issues.

It is always recommended to perform cluster level storage rescan if any changes like add or removal of storage device on the cluster.