ESXi hosts may suddenly begin reporting 0B GPU memory for NVIDIA Tesla M10 cards in the vSphere Client, while `nvidia-smi
` continues to show the correct 8GB memory allocation. This prevents new VMs with vGPU profiles from being provisioned, though existing VMs continue to run normally.
nvidia-smi
` shows correct 8GB RAM per device
The issue stems from excessive NVIDIA NVRM logging interfering with GPU memory size queries. Evidence can be found in vmkernel.log showing large numbers of repetitive NVRM log entries occurring:
cat vmkernel.log |grep nvrm-nvlog
2024-10-31T21:09:06.580Z cpu23:xxxxxxxx)nvrm-nvlog: (...a string of letters follow here...)2024-10-31T21:09:06.580Z cpu23:xxxxxxxx)nvrm-nvlog: (...a string of letters follow here...)
2024-10-31T21:09:06.580Z cpu23:xxxxxxxx)nvrm-nvlog: (...a string of letters follow here...)
...
(hundreds of similar lines follow here)
esxcli system module parameters set -m nvidia -p "NVreg_RegistryDwords=RMDumpNvLog=0"
General NVIDIA GRID GPU analysis:
lspci | grep VGA
0000:e2:00.0 VGA compatible controller: NVIDIA Corporation Tesla M10 [vmgfx2]
0000:e3:00.0 VGA compatible controller: NVIDIA Corporation Tesla M10 [vmgfx3]
esxcli software vib list | grep -i nvd
NVD-VMware_ESXi_X.X.X_Driver
nvdgpumgmtdaemon
nvidia-smi
esxcli graphics device list
esxcli graphics device stats list
esxcli graphics host get
esxcli graphics vm list
ntpq -p