VMs using Nvidia Grid vGPU hitting Xids and failing to migrate at 68% with an error "Timed out waiting for migration data"
search cancel

VMs using Nvidia Grid vGPU hitting Xids and failing to migrate at 68% with an error "Timed out waiting for migration data"

book

Article ID: 421571

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

VMs using Nvidia Grid vGPU intermittently failed to migrate. When it occurs, vmotion task stuck with 68% and eventually failed with "Timed out waiting for migration data".

According to vmware.log, you might see pciPassthru for nvidia device throw error for saving checkpoint.

YYYY-MM-DDThh:mm:ss.###Z No(00) vcpu-0 - CheckpointTiming save: pciPassthru0 took 121238511 us
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - CPT: error saving group pciPassthru0, 0
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - Progress 0% (none)
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - Progress 101% (none)
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - DUMPER: Ending save. Expected 71 groups, but got 45.
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - MigrateWriteHostLog: Writing to log file took 3547 us.
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - MigrateSetStateFinished: type=1 new state=MIGRATE_TO_VMX_FINISHED
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - MigrateSetState: Transitioning from state MIGRATE_TO_VMX_CHECKPT (4) to MIGRATE_TO_VMX_FINISHED (6).
YYYY-MM-DDThh:mm:ss.###Z No(00) vcpu-0 - ConfigDB: Setting config.readOnly = "FALSE"
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - Migrate_SetFailureMsgList: switching to new log file.
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - Migrate_SetFailureMsgList: Now in new log file.
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - Migrate: Caching migration error message list:
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - [msg.vmx.plugin.vmiop.migrate.get.checkpoint.buffer.failed] Failed to get device checkpoint buffer.
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - [msg.checkpoint.migration.writefail] Failed to write checkpoint data (offset 472195, size 6375): Failed to resume virtual machine.
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - Msg_Post: Error
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - [msg.checkpoint.migration.writefail] Failed to write checkpoint data (offset 472195, size 6375): Failed to resume virtual machine.
YYYY-MM-DDThh:mm:ss.###Z In(05) vcpu-0 - [msg.vmx.plugin.vmiop.migrate.get.checkpoint.buffer.failed] Failed to get device checkpoint buffer.

Also, vmkernel.log indicates nvidia Xid errors like below.

YYYY-MM-DDThh:mm:ss.###Z In(182) vmkernel: cpu3:2098990)NVRM: Xid (PCI:0000:3f:00): XX

Environment

VMware vSphere ESXi

Cause

This is a combination of NVIDIA device problem and ESXi misbehavior. ESXi doesn't handle NVIDIA Xid error during vMotion and it eventually fails. However, the situation where NVIDA device throws Xid error itself is a problem and should be investigated.

Resolution

The fix for ESXi misbehavior mentioned in [Cause] section is applied from ESXi 8.0 U3h or later.

Also, we recommend engaging NVIDIA support to investigate the cause of Xid errors.