Symptoms:
VMware vCenter Server 7.0.x
VMware vCenter Server 8.0.x
vMotion of a vGPU VM will fail if the NVIDIA GPU ECC Mode is different on source and destination ESXi hosts.
To confirm the GPU ECC mode enabled or disabled in ESXi hosts,
1. Run the command "nvidia-smi" on both hosts. The below picture shows the sample output
The RED cycle value is the ECC mode:
* 0 = ENABLED
* Off = DISABLED
Both ESXi hosts must be same. If not, it need change one of them to be same with another.
2. Change the ECC mode with below command in ESXi host:
# nvidia-smi --ecc-config=ENABLED|DISABLED
For example, to set the ECC mode to be disabled:
# nvidia-smi --ecc-config=DISABLED
3. Reboot the ESXi host to make this change effective.
This issue was noticed when using driver version:
NVD-VMware_ESXi_7.0.2_Driver - 525.125.03-1OEM.702.0.0.17630552
We did not see the problem after upgrading driver to version:
NVD-VMware_ESXi_8.0.0_Driver - 525.125.03-1OEM.800.1.0.20613240