Virtual Machine Crashes and is Rebooted by vSphere HA with NVIDIA Plugin Error: "GSP plugin task crashed"
search cancel

Virtual Machine Crashes and is Rebooted by vSphere HA with NVIDIA Plugin Error: "GSP plugin task crashed"

book

Article ID: 438288

calendar_today

Updated On:

Products

VMware vSphere ESXi 8.0

Issue/Introduction

  • Virtual machines utilizing NVIDIA L4 GPUs crash unexpectedly.
  • vSphere High Availability (HA) automatically restarts the affected virtual machines.
  • The following event is observed in vCenter Server at the Cluster level:
  • "The virtual machine was restarted automatically by vSphere HA on this host. This response may be triggered by a failure of the host on which the virtual machine was originally running, or by an unclean power-off of the virtual machine (e.g., if the VMX process was killed)"
  • Additionally, memory utilization for the NVIDIA L4 GPU shows 0 (The utilization reports correctly again after the ESXi host reboot).
  • The affected virtual machine's vmware.log file displays the following errors indicating a GPU System Processor (GSP) plugin task crash and RPC timeouts:
    • Location: vmfs/volumes/datastore/vm folder
    • YYYY-MM-DDT0HH:MM:SSvthread-3459696 - vmiop log: (0x0): GSP plugin task crashed. VM shutdown is required.
      YYYY-MM-DDT0HH:MM:SS (05) vmx 311e3bed-da-971a vigor Reset: Attaching to reset.
      YYYY-MM-DDT0HH:MM:SS (05) vcpu-0 VMIOP: informing the plugin vmiop-display of checkpoint state change: 2
      YYYY-MM-DDT0HH:MM:SS (02) vcpu-0 vmiop_log: (0x0): Timed out, GSP has not started processing message 14
      YYYY-MM-DDT0HH:MM:SS (02) vcpu-0 vmiop_log: (0x0): CPU RPC 14 fw response failed: 0x7
      YYYY-MM-DDT0HH:MM:SS (02) vcpu-0 vmiop_log: (0x0): Migration Buff Reset RPC failed: 0x7
      YYYY-MM-DDT0HH:MM:SS (02) vcpu-0-da-971a vmiop_log: (0x0): stop work failed

  • The ESXi host's vmkernel.log displays memory corruption and sequence errors related to the NVIDIA Resource Manager (NVRM):
    • Location: /var/run/log/vmkernel.log
    • YYYY-MM-DDT0HH:MM:SS In(182) vmkernel: cpu77:2100363)NVRM: _issueRpcAndWait: rpcRecvPoll failed with status 0x00000025 for fn 76 sequence 45527850!
      YYYY-MM-DDT0HH:MM:SS In(182) vmkernel: cpu93:2097456)NVRM: GspMsgQueueReceiveStatus: Bad sequence number.  Expected 47333697 got 47333698. Possible memory corruption.
      YYYY-MM-DDT0HH:MM:SS In(182) vmkernel: cpu93:2097456)NVRM: GspMsgQueueReceiveStatus: Bad sequence number.  Expected 47333697 got 47333698. Possible memory corruption.
      YYYY-MM-DDT0HH:MM:SS In(182) vmkernel: cpu93:2097456)NVRM: GspMsgQueueReceiveStatus: Bad sequence number.  Expected 47333697 got 47333698. Possible memory corruption.
      YYYY-MM-DDT0HH:MM:SS In(182) vmkernel: cpu93:2097456)NVRM: GspMsgQueueReceiveStatus: Read failed after 3 retries.
      YYYY-MM-DDT0HH:MM:SS vmkernel: cpu93:2097456)NVRM: nvAssertOkFailedNoLog: Assertion failed: Invalid data passed [NV_ERR_INVALID_DATA] (0x00000025) returned from _kgspRpcDrainEvents(pGpu, pKernelGsp, NV_VGPU_MSG_FUNCTION_NUM_FUNCTIONS, 0, KGSP_RPC_EVENT_HANDLE$
      Environment

Environment

  • VMware ESXi 8.0.3
  • NVIDIA L4 GPU

Cause

  • This issue occurs due to a fatal error within the NVIDIA GPU System Processor (GSP) plugin.
  • The ESXi host (vmkernel.log) logs indicate potential memory corruption on the hardware/driver level, causing the GSP plugin task to crash. Because the virtual machine's VMX process relies on this plugin for GPU operations, the task failure forces an unclean power-off of the virtual machine.
  • This abrupt shutdown is detected by vSphere HA, which subsequently triggers an automatic restart of the VM to restore service.

Resolution

Contact the NVIDIA team for further assistance.