A vMotion of a VM with (typically 2 x NVIDIA vGPUs) will frequently get stuck at 20%.
When it does get stuck - it renders the VM completely unresponsive, necessitating manual intervention to recover.
The affected VM completely loses connectivity (console, SSH, UI) and becomes unresponsive.
The issue is intermittent; the same VM can vMotion successfully at one time and fail at another.
There is no clear pattern, and it can happen even when the affected VM is the only one on the ESX host.
The only resolution is to kill the VM process on the source host using the below command on the ESX host -
esxcli vm process kill -t=force -w=<WorldID>
The NVIDIA driver that is running is NVIDIA driver 535.216.01, this corresponds to GRID 16.8.
The following is observed in the vmware.log for the VM -
YYYY-MM-DDThh:mm:ss.sssZ In(05) vmx - MigrateSetState: Transitioning from state MIGRATE_TO_VMX_PREPARING (2) to MIGRATE_TO_VMX_PRECOPY (3). YYYY-MM-DDThh:mm:ss.sssZ In(05) vmx - VMIOP: notifying plugin state prepare YYYY-MM-DDThh:mm:ss.sssZ In(05) vmx - VMIOP: informing the plugin vmiop-display of checkpoint state change: 1 YYYY-MM-DDThh:mm:ss.sssZ In(05) worker-XXXXXXX - MigrateDevCptStreamBufFetcher: Finish fetching zero copy buffer from kernel. (endOfStreams false failureCode 0 isListRetired true)
vCenter 8.0u3g
An issue has been identified with the pre-copy notification related to the NVIDIA vGPUs.
NVIDIA's driver isn't returning control to the VMware stack, as it needs to - resulting in the issue observed.
Upgrade the NVIDIA driver version to a version newer than 535.216.01
This is expected to resolve the issue.
If the issue persists after the NVIDIA driver is upgraded to a version newer than 535.216.01 please do the following -
- open a case with NVIDIA
- open a case with Broadcom Support and share the NVIDIA case number