'nvidia-smi' module causes ESXi host to become unresponsive
search cancel

'nvidia-smi' module causes ESXi host to become unresponsive

book

Article ID: 386012

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:

ESXi host become unresponsive.
Host will come back accessible after the reboot.

/var/run/log/vmkernel.log on ESXi hosts is filled with the Admission failure events from 'nvidia-smi' module as below:

YYYY-MM-DDTHH:MM:SS.Z cpu72:274936160)Admission failure in path: host/vim/vimuser/terminal/ssh:nvidia-smi.274936232:uw.274936232
YYYY-MM-DDTHH:MM:SS.Z cpu72:274936160)UserWorld 'nvidia-smi' with cmdline 'unknown'
YYYY-MM-DDTHH:MM:SS.Z cpu72:274936160)uw.274936232 (2403560232) extraMin/extraFromParent: 3059/3059, ssh (672) childEmin/eMinLimit: 203341/204800
YYYY-MM-DDTHH:MM:SS.Z cpu72:274936160)WARNING: LinuxThread: 424: nvidia-smi: Error cloning thread: -28 (bad0081)
YYYY-MM-DDTHH:MM:SS.Z cpu72:274936160)Admission failure in path: host/vim/vimuser/terminal/ssh:nvidia-smi.274936233:uw.274936233
YYYY-MM-DDTHH:MM:SS.Z cpu72:274936160)UserWorld 'nvidia-smi' with cmdline 'unknown'
YYYY-MM-DDTHH:MM:SS.Z cpu72:274936160)uw.274936233 (2403560241) extraMin/extraFromParent: 3059/3059, ssh (672) childEmin/eMinLimit: 203352/204800
YYYY-MM-DDTHH:MM:SS.Z cpu72:274936160)WARNING: LinuxThread: 424: nvidia-smi: Error cloning thread: -28 (bad0081)
.. 
.. 
YYYY-MM-DDTHH:MM:SS.Z cpu16:2105810)Admission failure in path: host/vim/vmvisor/NVIDIAHost:nv-hostengine.274936450:uw.274936450
YYYY-MM-DDTHH:MM:SS.Z cpu16:2105810)UserWorld 'nv-hostengine' with cmdline 'unknown'
YYYY-MM-DDTHH:MM:SS.Z cpu16:2105810)uw.274936450 (2403562149) extraMin/extraFromParent: 13643/13643, NVIDIAHost (39528) childEmin/eMinLimit: 20349/32768
YYYY-MM-DDTHH:MM:SS.Z cpu16:2105810)WARNING: LinuxThread: 424: nv-hostengine: Error cloning thread: -28 (bad0081)
YYYY-MM-DDTHH:MM:SS.Z cpu97:274934361)WARNING: Heap: 3898: Could not allocate 102400 bytes for dynamic heap vsansparse. Request returned Out of memory (ok to retry)
YYYY-MM-DDTHH:MM:SS.Z cpu97:274934361)WARNING: Heap: 4109: Heap_Align(vsansparse, 98776/98776 bytes, 8 align) failed.  caller: 0x420018fb3580
YYYY-MM-DDTHH:MM:SS.Z cpu97:274934361)WARNING: Heap: 3898: Could not allocate 102400 bytes for dynamic heap vsansparse. Request returned Out of memory (ok to retry)

Environment

VMware ESXi 7.x
VMware ESXi 8.x

Cause

Memory leak from nvidia-smi module cause the ESXi host to become unresponsive.

Resolution

The NVIDIA System Management Interface (nvidia-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.

Please engage NVIDIA support to investigate the memory leak and nvidia-smi tool behavior with the current driver version.