Manual Crash Triggered by NMI on Virtual machine hosted on vCenter/ESXi
search cancel

Manual Crash Triggered by NMI on Virtual machine hosted on vCenter/ESXi

book

Article ID: 389912

calendar_today

Updated On:

Products

VMware vSphere ESX 8.x VMware vSphere ESXi 8.0

Issue/Introduction

A manual crash was triggered on the system using a Non-Maskable Interrupt (NMI), as indicated by the dump analysis from the OS vendor (Microsoft).

0: kd> !di
Kernel Version      : Windows 10 Kernel Version 14393 MP (6 procs) Free x64
Product             : Server, suite: TerminalServer SingleUserTS
Edition build lab   : 14393.7693.amd64fre.rs1_release.241212-1815
System Manufacturer : VMware, Inc.
System Product Name : VMware Virtual Platform
Processor           : Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz
Bugcheck Info       : NMI Bugcheck 80
Dump Type           : Kernel Summary (Kernel address space is available - user address space may not be available)
Build Revision      : 14393.7699

NMI_HARDWARE_FAILURE (80)
This is typically due to a hardware malfunction.  The hardware supplier should
be called.
Arguments:
Arg1: 00000000004f4454, 'TDO'
Arg2: 0000000000000000, Status Byte
Arg3: 0000000000000000
Arg4: 0000000000000000

Debugging Details:
VIRTUAL_MACHINE:  VMware
FAULTING_THREAD:  fffff8021b1c5980
PROCESS_NAME:  System
STACK_TEXT: 
fffff802`1d02fc08 fffff802`1b65aa4e     : 00000000`00000080 00000000`004f4454 00000000`00000000 00000000`00000000 : nt!KeBugCheckEx [d:\rs1\minkernel\ntos\ke\amd64\procstat.asm @ 127]
fffff802`1d02fc10 fffff802`1b034920     : ffff810c`6aa82968 fffff802`1b672e30 fffff802`1b672e30 00000000`00000001 : hal!HalBugCheckSystem+0x7e [d:\rs1\minkernel\hals\lib\whea\mca.c @ 3171]
fffff802`1d02fc50 fffff802`1b65ba1e     : fffff802`000006c0 fffff802`1b1c3630 fffff802`1d02fd30 fffff802`1aec41b8 : nt!WheaReportHwError+0x258 [d:\rs1\minkernel\ntos\whea\whea.c @ 810]
fffff802`1d02fcb0 fffff802`1aec3ad6     : 00000000`00000001 00000000`00000000 000000a8`1f6b2685 fffff802`1aec3f88 : hal!HalHandleNMI+0xfe [d:\rs1\minkernel\hals\lib\whea\nmi.c @ 396]
fffff802`1d02fce0 fffff802`1af6e242     : 00000000`00000001 fffff802`1d02fef0 00000000`00000000 00000000`00000000 : nt!KiProcessNMI+0x106 [d:\rs1\minkernel\ntos\ke\amd64\misc.c @ 298]
fffff802`1d02fd30 fffff802`1af6e043     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KxNmiInterrupt+0x82 [d:\rs1\minkernel\ntos\ke\amd64\trap.asm @ 697]
fffff802`1d02fe70 fffff802`1b011cd9     : 00000000`00000046 ffff810c`6ac460f0 00000000`00000000 00000000`00000000 : nt!KiNmiInterrupt+0x1c3 [d:\rs1\minkernel\ntos\ke\amd64\trap.asm @ 657]
fffff802`1d01c200 fffff802`1ae66c8e     : fffff802`1b011cf0 00000000`00000000 ffff810c`6ac46010 00000000`000000ae : nt!PpmIdleGuestExecute+0x15 [d:\rs1\minkernel\ntos\po\ppmhv.c @ 697]
fffff802`1d01c240 fffff802`1ae65e1a     : fffff804`0fa65f40 012d92f8`012d92f8 00000000`00000000 00000000`00000000 : nt!PpmIdleExecuteTransition+0xcbe [d:\rs1\minkernel\ntos\po\ppmidle.c @ 3925]
fffff802`1d01c4c0 fffff802`1af6611c     : 00000000`00000000 fffff802`1b149180 fffff802`1b1c5980 ffff810c`79728040 : nt!PoIdle+0x33a [d:\rs1\minkernel\ntos\po\ppmidle.c @ 1046]
fffff802`1d01c620 00000000`00000000     : fffff802`1d01d000 fffff802`1d016000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x2c [d:\rs1\minkernel\ntos\ke\amd64\idle.asm @ 110]

 

Environment

VMware vSphere ESX 8.x

VMware vSphere ESX 7.x

Cause

In this scenario, you might encounter a virtual machine (VM) in a suspended state, with the task not providing sufficient information. Users may assert that no suspended tasks were initiated by them, suggesting that the VM's suspension was not a result of user intervention.

The affected Virtual Machine did not enter a suspended state automatically, necessitating manual intervention to address performance issues.

You will see relevant log snippets in VMware log and hostd show that the crash occurred  The VM's failure to suspend automatically led to a manual crash via NMI to address operational inefficiencies.

var/run/log/hostd.log

2025-02-18T10:45:37.090Z info hostd[2162299] [Originator@6876 sub=Vimsvc.CgiServiceManager opID=CGI-server-8b44] Validating CGI ticket for URL '/cgi-bin/vm-support.cgi?listmanifests=1'
2025-02-18T10:45:51.170Z info hostd[2162204] [Originator@6876 sub=Vimsvc.CgiServiceManager opID=CGI-server-8b5b] Validating CGI ticket for URL '/cgi-bin/vm-support.cgi?manifests=VirtualMachines:CoreDumpHung HungVM:Send_NMI_To_Guest HungVM:Suspend_VM'
2025-02-18T11:11:45.908Z info hostd[2162334] [Originator@6876 sub=Vimsvc.CgiServiceManager opID=CGI-server-9088] Validating CGI ticket for URL '/cgi-bin/vm-support.cgi?manifests=VirtualMachines:CoreDumpHung'

vmfs/volume/datastore/vmware.log

2025-02-18T10:46:51.499Z Wa(03) vcpu-0 - WinBSOD: Synthetic MSR[0x40000100] 0x80
2025-02-18T10:46:51.499Z Wa(03)+ vcpu-0 -
2025-02-18T10:46:51.499Z Wa(03) vcpu-0 - WinBSOD: Synthetic MSR[0x40000101] 0x4f4454
2025-02-18T10:46:51.499Z Wa(03)+ vcpu-0 -
2025-02-18T10:46:51.499Z Wa(03) vcpu-0 - WinBSOD: Synthetic MSR[0x40000102] 0x0
2025-02-18T10:46:51.499Z Wa(03)+ vcpu-0 -
2025-02-18T10:46:51.499Z Wa(03) vcpu-0 - WinBSOD: Synthetic MSR[0x40000103] 0x0
2025-02-18T10:46:51.499Z Wa(03)+ vcpu-0 -
2025-02-18T10:46:51.499Z Wa(03) vcpu-0 - WinBSOD: Synthetic MSR[0x40000104] 0x0
2025-02-18T10:46:51.499Z Wa(03)+ vcpu-0 -


2025-02-18T10:46:51.521Z In(05) vmx - SUSPEND: Start suspend (flags=0)
2025-02-18T10:46:51.535Z In(05) vcpu-0 - Progress -1% (msg.checkpoint.saveStatus)
2025-02-18T10:46:51.535Z In(05) vcpu-0 - Checkpointed in VMware ESX, 7.0.3, build-22348816, Linux Host
2025-02-18T10:46:51.679Z In(05) vcpu-0 - Progress 0% (none)
2025-02-18T10:46:51.679Z In(05) vcpu-0 - MainMem: Writing full memory image, '/vmfs/volumes/XXXXXXXX-XXXXXXXX-XXX-XXXXXXXXXX/VMName/VMName-XXXXXXXX.vmem'.
2025-02-18T10:46:53.011Z In(05) vcpu-0 - Progress 1% (none)
2025-02-18T10:46:54.394Z In(05) vcpu-0 - Progress 2% (none)
2025-02-18T10:46:55.768Z In(05) vcpu-0 - Progress 3% (none)
2025-02-18T10:46:57.143Z In(05) vcpu-0 - Progress 4% (none)
2025-02-18T10:49:14.637Z In(05) vcpu-0 - SUSPEND: Completed suspend: 'Operation completed successfully' (0)

2025-02-18T10:49:14.637Z In(05) vmx - Stopping VCPU threads...
2025-02-18T10:49:14.637Z In(05) vcpu-0 - VMMon_WaitForExit: vcpu-0: worldID=2103623
2025-02-18T10:49:14.637Z In(05) vcpu-1 - VMMon_WaitForExit: vcpu-1: worldID=2103627
2025-02-18T10:49:14.637Z In(05) vcpu-3 - VMMon_WaitForExit: vcpu-3: worldID=2103629
2025-02-18T10:49:14.637Z In(05) vcpu-2 - VMMon_WaitForExit: vcpu-2: worldID=2103628
2025-02-18T10:49:14.637Z In(05) vcpu-4 - VMMon_WaitForExit: vcpu-4: worldID=2103630
2025-02-18T10:49:14.637Z In(05) vmx - MonitorVMMCoreRequest: ********************************************
2025-02-18T10:49:14.637Z In(05) vmx - MonitorVMMCoreRequest: Sync core dump requested; not a real fault
2025-02-18T10:49:14.637Z In(05) vmx - MonitorVMMCoreRequest: ********************************************
2025-02-18T10:49:14.637Z In(05) vcpu-5 - VMMon_WaitForExit: vcpu-5: worldID=2103631
2025-02-18T10:49:14.637Z Wa(03) vmx - Dumping vmx core per request
2025-02-18T10:49:15.658Z Wa(03) vmx - A core file is available in "/vmfs/volumes/XXXXXXXX-XXXXXXXX-XXX-XXXXXXXXXX/VMName-XXXXXXX/vmx-zdump.000"

 

 

Resolution

When someone initiated a log bundle generation by selecting the HungVM option, causing the VM to enter a suspend state. The traces indicate that there are no underlying issues with either ESXi or vCenter. The host reboots were initiated by user intervention, suggesting intentional restarts rather than spontaneous failures.

Additional Information