ESXi Host Reports 0B GPU Memory for NVIDIA Tesla M10 Despite Correct nvidia-smi Output

Products

VMware vSphere ESXi

Issue/Introduction

ESXi hosts may suddenly begin reporting 0B GPU memory for NVIDIA Tesla M10 cards in the vSphere Client, while `nvidia-smi` continues to show the correct 8GB memory allocation. This prevents new VMs with vGPU profiles from being provisioned, though existing VMs continue to run normally.

Environment

VMware ESXi 7.0 or newer
NVIDIA Tesla M10 GPUs
Grid vGPU profiles
ESXi shows graphics devices with 0B RAM in vSphere Client
`nvidia-smi` shows correct 8GB RAM per device
Unable to start new VMs with vGPU profiles
Error messages include:
- "No host is compatible with the virtual machine"
- "Could not initialize plugin 'libnvidia-vgx.so' for vGPU"
- "Insufficient resources. One or more devices (pciPassthru0) required by VM"

Cause

The issue stems from excessive NVIDIA NVRM logging interfering with GPU memory size queries. Evidence can be found in vmkernel.log showing large numbers of repetitive NVRM log entries occurring:

cat vmkernel.log |grep nvrm-nvlog 2024-10-31T21:09:06.580Z cpu23:xxxxxxxx)nvrm-nvlog: (...a string of letters follow here...)
2024-10-31T21:09:06.580Z cpu23:xxxxxxxx)nvrm-nvlog: (...a string of letters follow here...) 2024-10-31T21:09:06.580Z cpu23:xxxxxxxx)nvrm-nvlog: (...a string of letters follow here...) ... (hundreds of similar lines follow here)

Resolution

As a workaround, use the following command to disable excessive logging:
- esxcli system module parameters set -m nvidia -p "NVreg_RegistryDwords=RMDumpNvLog=0"
- Wait 10 seconds, then refresh the host in vCenter
- If the GPU is in a good state when we query the values
  - "Disconnect" and "Reconnect" the host
For a resolution, or if issues persist:
1. Collect esxi host diagnostic information per Collecting diagnostic information for ESX/ESXi hosts and vCenter Server using the vSphere Web Client
2. Contact NVIDIA Support with:
  1. ESXi host support bundle
  2. Output from all commands listed in Diagnostic Steps

Additional Information

After implementing workaround, monitor vmkernel.log for recurring NVRM entries
Compare nvidia-smi process list with graphics vm list output to identify discrepancies
An efuse reset (hardware power reset) of the blade slot may temporarily resolve the issue

General NVIDIA GRID GPU analysis:

Verify GPU visibility in ESXi:
1. lspci | grep VGA
2. Expected output should show multiple NVIDIA Tesla M10 entries with [vmgfx] tags
  Example:
  0000:e2:00.0 VGA compatible controller: NVIDIA Corporation Tesla M10 [vmgfx2]
  0000:e3:00.0 VGA compatible controller: NVIDIA Corporation Tesla M10 [vmgfx3]
Check NVIDIA driver installation:
1. esxcli software vib list | grep -i nvd
2. Look for two VIBs:
  NVD-VMware_ESXi_X.X.X_Driver
  Note that X.X.X, is the version, such as "7.0.2"
  nvdgpumgmtdaemon
3. Verify versions match across all hosts in cluster
Check GPU status and memory using nvidia-smi:
1. nvidia-smi
  This shows:
  - Driver Version
  - GPU Memory capacity and usage
  - Active processes/VMs using each GPU
  - Temperature and power consumption
Check ESXi graphics configuration:
1. esxcli graphics device list
2. esxcli graphics device stats list
3. esxcli graphics host get
4. esxcli graphics vm list
  
  Compare memory values between these commands and nvidia-smi output
Verify time synchronization:
1. ntpq -p
2. Check for active NTP servers and proper synchronization