ESXi Host remediation fails with "Failed to remove Component NVD-VGPU-800"
search cancel

ESXi Host remediation fails with "Failed to remove Component NVD-VGPU-800"

book

Article ID: 432714

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere ESX 8.x

Issue/Introduction

  • When attempting to update or remediate an ESXI host using vSphere Lifecycle Manager (vLCM) , the task fails with an error similar to:

    • Remediation failed, Failed to remove Component NVD-VGPU-800(<Version>) files may still be in use.

  • This issue typically occurs in environments using NVIDIA AI Enterprise or GRID vGPU drivers.

  • Errors in the ESXI log located at /var/log/esxupdate.log

YYYY-MM-DDT HH:MM:SS [ESX_ACCEPTANCE_ERROR] Failed to remove Component NVD-VGPU-800(<Version_ID>), files may still be in use.

YYYY-MM-DDT HH:MM:SS [vLCM_REMEDIATION_FAILURE] Host: <ESXi_FQDN> Status: FAILED

 

Environment

  • VMware vSphere ESXi 8.x

  • NVIDIA vGPU Manager

Cause

The failure is caused by the nvdGpuMgmtDaemon service remaining active during the VIB removal or upgrade process. This service maintains an open handle on the NVIDIA driver files, preventing the ESXi host from removing the existing component.

Resolution

To resolve this issue, manually stop the NVIDIA services on the affected host:

  1. Place the affected ESXi host into Maintenance Mode.

  2. Log in to the host via SSH as root.

  3. Run the following commands to check service status:

    • /etc/init.d/nvdGpuMgmtDaemon status
    • /etc/init.d/xorg status
    • /etc/init.d/nvidia-vgpu status

  4. Any services currently in a "Running" state from the command output above must be stopped:

    • /etc/init.d/nvdGpuMgmtDaemon stop
    • /etc/init.d/xorg stop
    • /etc/init.d/nvidia-vgpu stop

  5. Return to the vSphere Client and retry the remediation of the host.

  6. Once remediation is successful, the host will reboot and services will restart automatically.