Symptoms:
VMware vCenter Server 7.0.x
VMware vCenter Server 8.0.x
VMware ESXi 7.0
VMware ESXi 8.0
vMotion of a vGPU VM will fail if the NVIDIA GPU ECC Mode is different on source and destination ESXi hosts.
To confirm the GPU ECC mode enabled or disabled in ESXi hosts,
1. Run the command "nvidia-smi -q" on both hosts to get the ECC Mode setting:
EX:
BOTH ESXi hosts must be set the same for ECC Memory setting for vmotion to be successful:
2. If you want to change the ECC status to OFF for all GPUs on your host machine or vGPUs assigned to the VM, run this command:
nvidia-smi -e 0
3. If you want to change the ECC status to ON for all GPUs on your host machine or vGPUs assigned to the VM, run this command:
nvidia-smi -e 1
4. Reboot the ESXi host to make this change effective.
NVIDIA reference documentation:
https://docs.nvidia.com/vgpu/13.0/grid-vgpu-user-guide/index.html#gpumodeswitch
Section: "2.13. Disabling and Enabling ECC Memory"
This issue was noticed when using driver version:
NVD-VMware_ESXi_7.0.2_Driver - 525.125.03-1OEM.702.0.0.17630552
We did not see the problem after upgrading driver to version:
NVD-VMware_ESXi_8.0.0_Driver - 525.125.03-1OEM.800.1.0.20613240