Windows or Linux virtual machines running on AMD Zen2/Zen3 may panic and shut down due to a doublefault exception.
On Linux systems, the output of dmesg in the virtual machine will contain a backtrace similar to the following:
[XXXXXX.XXXXXX] PANIC: double fault, error_code: 0x0
[XXXXXX.XXXXXX] Call Trace:
[XXXXXX.XXXXXX] <#DF>
[XXXXXX.XXXXXX] ? df_debug+0x1d/0x36
[XXXXXX.XXXXXX] ? do_double_fault+0xe5/0x180
[XXXXXX.XXXXXX] ? double_fault+0x1e/0x30
[XXXXXX.XXXXXX] ? acpi_processor_thermal_init.cold.6+0x66/0x66
[XXXXXX.XXXXXX] ? native_safe_halt+0xe/0x20
[XXXXXX.XXXXXX] </#DF>
[XXXXXX.XXXXXX] acpi_idle_do_entry+0x93/0xa0
[XXXXXX.XXXXXX] acpi_idle_enter+0x5f/0xd0
[XXXXXX.XXXXXX] cpuidle_enter_state+0x86/0x470
[XXXXXX.XXXXXX] cpuidle_enter+0x2c/0x40
[XXXXXX.XXXXXX] do_idle+0x26f/0x2d0
[XXXXXX.XXXXXX] cpu_startup_entry+0x6f/0x80
[XXXXXX.XXXXXX] start_secondary+0x187/0x1d0
[XXXXXX.XXXXXX] secondary_startup_64_no_verify+0xd1/0xdb
Another Example:
[XXXXXX.XXXXXX] traps: PANIC: double fault, error_code: 0x0
[XXXXXX.XXXXXX] double fault: 0000 [#1] PREEMPT SMP NOPTI
[XXXXXX.XXXXXX] CPU: XX PID: 0 Comm: swapper/12 Kdump: loaded Not tainted 6.4.0-150600.23.38-default #1
[XXXXXX.XXXXXX] Hardware name: VMware, Inc. VMware7/440BX Desktop Reference Platform, BIOS VMW7
[XXXXXX.XXXXXX] RIP: 0010:error_entry+0x1a/0x150
...
[XXXXXX.XXXXXX] Call Trace:
[XXXXXX.XXXXXX] <#DF>
[XXXXXX.XXXXXX] ? __die_body+0x1a/0x60
[XXXXXX.XXXXXX] ? die+0x38/0x60
[XXXXXX.XXXXXX] ? exc_double_fault+0x175/0x190
[XXXXXX.XXXXXX] ? asm_exc_double_fault+0x1f/0x30
[XXXXXX.XXXXXX] ? early_xen_iret_patch+0xc/0xc
[XXXXXX.XXXXXX] ? asm_exc_page_fault+0x9/0x30
[XXXXXX.XXXXXX] ? error_entry+0x1a/0x150
[XXXXXX.XXXXXX] </#DF>
To determine if this is a match, you can check in the dmesg log with the following command:
grep "double fault" /var/crash/*dmesg* -A20
On Windows systems, the stack trace for the dump file will be similar to the following:
nt!KeBugCheckExnt!KiBugCheckDispatch+0x69nt!KiDoubleFaultAbort+0x2bdhal!HalProcessorIdle+0xfnt!PpmIdleDefaultExecute+0x1bnt!PpmIdleExecuteTransition+0x6bcnt!PoIdle+0x33fnt!KiIdleLoop+0x2c
Windows or Linux virtual machines running on ESXi 7.0 U3 or later releases and AMD Zen2/Zen3 CPU.
The cause of this is currently unknown and under investigation.
This is a known issue and currently there is no resolution.
To workaround the issue you will have to reboot the VM to recover.
Broadcom Engineering and AMD are actively investigating to identify a workaround and/or fix.
If this can be reproduced please document the steps/workload etc & enable VM debug logging https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere/8-0/vsphere-virtual-machine-administration-guide-8-0/configuring-virtual-machine-optionsvsphere-vm-admin/configuring-virtual-machine-advanced-optionsvsphere-vm-admin/configure-debugging-and-statisticsvsphere-vm-admin.html
Select "Record Debugging Information" from the drop down. Note this can have a performance impact on VM's.
/var/crash directory of the VM.uname -a > uname-<vmname>.txtsysctl kernel.kptr_restrict (Remember the value X, high probably X=2)sysctl -w kernel.kptr_restrict=0cat /proc/kallsyms > kallsyms-<vmname>.txtcat /proc/iomem > iomem-<vmname>.txtsysctl -w kernel.kptr_restrict=X (Restore original value from above, in case of X=2 the command should be "sysctl -w kernel.kptr_restrict=2")C:\Windows\Memory.dmpC:\Windows\Minidump\procexp64.exe as Administrator