The vmware.log shows that the restoration of the vGPU started after vMotion, and TDR (Timeout Detection and Recovery) occurred a few seconds after the restoration was complete.
YYYY-MM-DDTHH:MM:SS.Z In(05) vthread-2163452 - vmiop_log: (0x0): Start restoring vGPU state ...
YYYY-MM-DDTHH:MM:SS.Z In(05) vcpu-0 - vmiop_log: (0x0): Finished restoring vGPU state.
YYYY-MM-DDTHH:MM:SS.Z Er(02) vthread-2163451 - vmiop_log: (0x0): Timeout occurred, reset initiated.
YYYY-MM-DDTHH:MM:SS.Z Er(02) vthread-2163451 - vmiop_log: (0x0): TDR_DUMP:0x52445456 0x006907e8 0x000001cc 0x00000001
YYYY-MM-DDTHH:MM:SS.Z Er(02) vthread-2163451 - vmiop_log: (0x0): TDR_DUMP:0x00989680 0x00000000 0x000001bb 0x0000000f
VMware vSphere ESXi 8.x
TDR (Timeout Detection and Recovery) is a feature where the Windows OS forcibly resets the graphics driver when the GPU does not respond for a certain period of time . This forced reset could cause the driver and related applications to hang or crash.
Refer to the Microsoft document below about TDR:
WDDM Support for Timeout Detection and Recovery (TDR) - Windows drivers | Microsoft Learn
Contact NVIDIA for further investigation of the root cause of this TDR.