VMs using Nvidia Grid vGPU intermittently failed to migrate. When it occurs, vmotion task stuck with 68% and eventually failed with "Timed out waiting for migration data".
According to vmware.log, you might see pciPassthru for nvidia device throw error for saving checkpoint.
YYYY-MM-DDThh:mm:ss.###Z No(00) vcpu-0 - CheckpointTiming save: pciPassthru0 took 121238511 us
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - CPT: error saving group pciPassthru0, 0
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - Progress 0% (none)
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - Progress 101% (none)
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - DUMPER: Ending save. Expected 71 groups, but got 45.
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - MigrateWriteHostLog: Writing to log file took 3547 us.
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - MigrateSetStateFinished: type=1 new state=MIGRATE_TO_VMX_FINISHED
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - MigrateSetState: Transitioning from state MIGRATE_TO_VMX_CHECKPT (4) to MIGRATE_TO_VMX_FINISHED (6).
YYYY-MM-DDThh:mm:ss.###Z No(00) vcpu-0 - ConfigDB: Setting config.readOnly = "FALSE"
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - Migrate_SetFailureMsgList: switching to new log file.
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - Migrate_SetFailureMsgList: Now in new log file.
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - Migrate: Caching migration error message list:
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - [msg.vmx.plugin.vmiop.migrate.get.checkpoint.buffer.failed] Failed to get device checkpoint buffer.
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - [msg.checkpoint.migration.writefail] Failed to write checkpoint data (offset 472195, size 6375): Failed to resume virtual machine.
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - Msg_Post: Error
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - [msg.checkpoint.migration.writefail] Failed to write checkpoint data (offset 472195, size 6375): Failed to resume virtual machine.
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - [msg.vmx.plugin.vmiop.migrate.get.checkpoint.buffer.failed] Failed to get device checkpoint buffer.
Also, vmkernel.log indicates nvidia Xid errors like below.
YYYY-MM-DDThh:mm:ss.###Z In(182) vmkernel: cpu3:2098990)NVRM: Xid (PCI:0000:3f:00): XX
VMware vSphere ESXi
This is a combination of NVIDIA device problem and ESXi misbehavior. ESXi doesn't handle NVIDIA Xid error during vMotion and it eventually fails. However, the situation where NVIDA device throws Xid error itself is a problem and should be investigated.
The fix for ESXi misbehavior mentioned in [Cause] section is applied from ESXi 8.0 U3h or later.
Also, we recommend engaging NVIDIA support to investigate the cause of Xid errors.