ESXi Host Reports 0B GPU Memory for NVIDIA Tesla M10 Despite Correct nvidia-smi Output
search cancel

ESXi Host Reports 0B GPU Memory for NVIDIA Tesla M10 Despite Correct nvidia-smi Output

book

Article ID: 382059

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

ESXi hosts may suddenly begin reporting 0B GPU memory for NVIDIA Tesla M10 cards in the vSphere Client, while `nvidia-smi` continues to show the correct 8GB memory allocation. This prevents new VMs with vGPU profiles from being provisioned, though existing VMs continue to run normally.

Environment

  • VMware ESXi 7.0 or newer
  • NVIDIA Tesla M10 GPUs
  • Grid vGPU profiles
  • ESXi shows graphics devices with 0B RAM in vSphere Client
  • `nvidia-smi` shows correct 8GB RAM per device
  • Unable to start new VMs with vGPU profiles
  • Error messages include:
    • "No host is compatible with the virtual machine"
    • "Could not initialize plugin 'libnvidia-vgx.so' for vGPU"
    • "Insufficient resources. One or more devices (pciPassthru0) required by VM"

 

Cause

The issue stems from excessive NVIDIA NVRM logging interfering with GPU memory size queries. Evidence can be found in vmkernel.log showing large numbers of repetitive NVRM log entries occurring:

cat vmkernel.log |grep nvrm-nvlog

2024-10-31T21:09:06.580Z cpu23:xxxxxxxx)nvrm-nvlog: (...a string of letters follow here...)

2024-10-31T21:09:06.580Z cpu23:xxxxxxxx)nvrm-nvlog: (...a string of letters follow here...)
2024-10-31T21:09:06.580Z cpu23:xxxxxxxx)nvrm-nvlog: (...a string of letters follow here...)
...
(hundreds of similar lines follow here)

Resolution

  • As a workaround, use the following command to disable excessive logging:
    • esxcli system module parameters set -m nvidia -p "NVreg_RegistryDwords=RMDumpNvLog=0"
    • Wait 10 seconds, then refresh the host in vCenter
    • If the GPU is in a good state when we query the values
      • "Disconnect" and "Reconnect" the host

  • For a resolution, or if issues persist:
    1. Collect esxi host diagnostic information per Collecting diagnostic information for ESX/ESXi hosts and vCenter Server using the vSphere Web Client

    2. Contact NVIDIA Support with:
      1. ESXi host support bundle
      2. Output from all commands listed in Diagnostic Steps

 

 

Additional Information

  • After implementing workaround, monitor vmkernel.log for recurring NVRM entries
  • Compare nvidia-smi process list with graphics vm list output to identify discrepancies
  • An efuse reset (hardware power reset) of the blade slot may temporarily resolve the issue

 

General NVIDIA GRID GPU analysis:

  1. Verify GPU visibility in ESXi:
    1. lspci | grep VGA
    2. Expected output should show multiple NVIDIA Tesla M10 entries with [vmgfx] tags
      Example:
          0000:e2:00.0 VGA compatible controller: NVIDIA Corporation Tesla M10 [vmgfx2]
          0000:e3:00.0 VGA compatible controller: NVIDIA Corporation Tesla M10 [vmgfx3]

  2. Check NVIDIA driver installation:
    1. esxcli software vib list | grep -i nvd
    2. Look for two VIBs:
          NVD-VMware_ESXi_X.X.X_Driver
              Note that X.X.X, is the version, such as "7.0.2"
          nvdgpumgmtdaemon
    3. Verify versions match across all hosts in cluster

  3. Check GPU status and memory using nvidia-smi:
    1. nvidia-smi
          This shows:
      • Driver Version
      • GPU Memory capacity and usage
      • Active processes/VMs using each GPU
      • Temperature and power consumption

  4. Check ESXi graphics configuration:
    1. esxcli graphics device list
    2. esxcli graphics device stats list
    3. esxcli graphics host get
    4. esxcli graphics vm list

      Compare memory values between these commands and nvidia-smi output

  5. Verify time synchronization:
    1. ntpq -p
    2. Check for active NTP servers and proper synchronization