VM with NVIDIA vGPU gets stuck during vMotion
search cancel

VM with NVIDIA vGPU gets stuck during vMotion

book

Article ID: 413748

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

A vMotion of a VM with (typically 2 x NVIDIA vGPUs) will frequently get stuck at 20%.

When it does get stuck - it renders the VM completely unresponsive, necessitating manual intervention to recover.

The affected VM completely loses connectivity (console, SSH, UI) and becomes unresponsive.

The issue is intermittent; the same VM can vMotion successfully at one time and fail at another.

There is no clear pattern, and it can happen even when the affected VM is the only one on the ESX host.

The only resolution is to kill the VM process on the source host using the below command on the ESX host - 

esxcli vm process kill -t=force -w=<WorldID>

The NVIDIA driver that is running is NVIDIA driver 535.216.01, this corresponds to GRID 16.8.

The following is observed in the vmware.log for the VM - 

YYYY-MM-DDThh:mm:ss.sssZ In(05) vmx - MigrateSetState: Transitioning from state MIGRATE_TO_VMX_PREPARING (2) to MIGRATE_TO_VMX_PRECOPY (3).  
YYYY-MM-DDThh:mm:ss.sssZ In(05) vmx - VMIOP: notifying plugin state prepare  
YYYY-MM-DDThh:mm:ss.sssZ In(05) vmx - VMIOP: informing the plugin vmiop-display of checkpoint state change: 1  
YYYY-MM-DDThh:mm:ss.sssZ In(05) worker-XXXXXXX - MigrateDevCptStreamBufFetcher: Finish fetching zero copy buffer from kernel. (endOfStreams false failureCode 0 isListRetired true)

 

 

Environment

vCenter 8.0u3g

 

Cause

An issue has been identified with the pre-copy notification related to the NVIDIA vGPUs.

NVIDIA's driver isn't returning control to the VMware stack, as it needs to - resulting in the issue observed.

 

Resolution

Upgrade the NVIDIA driver version to a version newer than 535.216.01

This is expected to resolve the issue.

Additional Information

If the issue persists after the NVIDIA driver is upgraded to a version newer than 535.216.01 please do the following - 

- open a case with NVIDIA

- open a case with Broadcom Support and share the NVIDIA case number