vSphere High Availability (HA) automatically restarts the affected virtual machines.
The following event is observed in vCenter Server at the Cluster level:
"The virtual machine was restarted automatically by vSphere HA on this host. This response may be triggered by a failure of the host on which the virtual machine was originally running, or by an unclean power-off of the virtual machine (e.g., if the VMX process was killed)"
Additionally, memory utilization for the NVIDIA L4 GPU shows 0 (The utilization reports correctly again after the ESXi host reboot).
The affected virtual machine's vmware.log file displays the following errors indicating a GPU System Processor (GSP) plugin task crash and RPC timeouts:
Location: vmfs/volumes/datastore/vm folder
YYYY-MM-DDT0HH:MM:SSvthread-3459696 - vmiop log: (0x0): GSP plugin task crashed. VM shutdown is required. YYYY-MM-DDT0HH:MM:SS (05) vmx 311e3bed-da-971a vigor Reset: Attaching to reset. YYYY-MM-DDT0HH:MM:SS (05) vcpu-0 VMIOP: informing the plugin vmiop-display of checkpoint state change: 2 YYYY-MM-DDT0HH:MM:SS (02) vcpu-0 vmiop_log: (0x0): Timed out, GSP has not started processing message 14 YYYY-MM-DDT0HH:MM:SS (02) vcpu-0 vmiop_log: (0x0): CPU RPC 14 fw response failed: 0x7 YYYY-MM-DDT0HH:MM:SS (02) vcpu-0 vmiop_log: (0x0): Migration Buff Reset RPC failed: 0x7 YYYY-MM-DDT0HH:MM:SS (02) vcpu-0-da-971a vmiop_log: (0x0): stop work failed
The ESXi host's vmkernel.log displays memory corruption and sequence errors related to the NVIDIA Resource Manager (NVRM):
Location: /var/run/log/vmkernel.log
YYYY-MM-DDT0HH:MM:SS In(182) vmkernel: cpu77:2100363)NVRM: _issueRpcAndWait: rpcRecvPoll failed with status 0x00000025 for fn 76 sequence 45527850! YYYY-MM-DDT0HH:MM:SS In(182) vmkernel: cpu93:2097456)NVRM: GspMsgQueueReceiveStatus: Bad sequence number. Expected 47333697 got 47333698. Possible memory corruption. YYYY-MM-DDT0HH:MM:SS In(182) vmkernel: cpu93:2097456)NVRM: GspMsgQueueReceiveStatus: Bad sequence number. Expected 47333697 got 47333698. Possible memory corruption. YYYY-MM-DDT0HH:MM:SS In(182) vmkernel: cpu93:2097456)NVRM: GspMsgQueueReceiveStatus: Bad sequence number. Expected 47333697 got 47333698. Possible memory corruption. YYYY-MM-DDT0HH:MM:SS In(182) vmkernel: cpu93:2097456)NVRM: GspMsgQueueReceiveStatus: Read failed after 3 retries. YYYY-MM-DDT0HH:MM:SS vmkernel: cpu93:2097456)NVRM: nvAssertOkFailedNoLog: Assertion failed: Invalid data passed [NV_ERR_INVALID_DATA] (0x00000025) returned from _kgspRpcDrainEvents(pGpu, pKernelGsp, NV_VGPU_MSG_FUNCTION_NUM_FUNCTIONS, 0, KGSP_RPC_EVENT_HANDLE$ Environment
Environment
VMware ESXi 8.0.3
NVIDIA L4 GPU
Cause
This issue occurs due to a fatal error within the NVIDIA GPU System Processor (GSP) plugin.
The ESXi host (vmkernel.log) logs indicate potential memory corruption on the hardware/driver level, causing the GSP plugin task to crash. Because the virtual machine's VMX process relies on this plugin for GPU operations, the task failure forces an unclean power-off of the virtual machine.
This abrupt shutdown is detected by vSphere HA, which subsequently triggers an automatic restart of the VM to restore service.