Windows or Linux virtual machines running on AMD Zen2/Zen3 CPU's crashing due to a doublefault

Products

VMware vSphere ESXi

Issue/Introduction

Issue:

Windows or Linux virtual machines running on AMD Zen2/Zen3 may panic and shut down due to a doublefault exception.

Symptoms:

On Linux systems, the output of dmesg in the virtual machine will contain a backtrace similar to the following:

[XXXXXX.XXXXXX] PANIC: double fault, error_code: 0x0 [XXXXXX.XXXXXX] Call Trace: [XXXXXX.XXXXXX] <#DF> [XXXXXX.XXXXXX] ? df_debug+0x1d/0x36 [XXXXXX.XXXXXX] ? do_double_fault+0xe5/0x180 [XXXXXX.XXXXXX] ? double_fault+0x1e/0x30 [XXXXXX.XXXXXX] ? acpi_processor_thermal_init.cold.6+0x66/0x66 [XXXXXX.XXXXXX] ? native_safe_halt+0xe/0x20 [XXXXXX.XXXXXX] </#DF> [XXXXXX.XXXXXX] acpi_idle_do_entry+0x93/0xa0 [XXXXXX.XXXXXX] acpi_idle_enter+0x5f/0xd0 [XXXXXX.XXXXXX] cpuidle_enter_state+0x86/0x470 [XXXXXX.XXXXXX] cpuidle_enter+0x2c/0x40 [XXXXXX.XXXXXX] do_idle+0x26f/0x2d0 [XXXXXX.XXXXXX] cpu_startup_entry+0x6f/0x80 [XXXXXX.XXXXXX] start_secondary+0x187/0x1d0 [XXXXXX.XXXXXX] secondary_startup_64_no_verify+0xd1/0xdb

Another Example:

[XXXXXX.XXXXXX] traps: PANIC: double fault, error_code: 0x0
[XXXXXX.XXXXXX] double fault: 0000 [#1] PREEMPT SMP NOPTI
[XXXXXX.XXXXXX] CPU: XX PID: 0 Comm: swapper/12 Kdump: loaded Not tainted 6.4.0-150600.23.38-default #1
[XXXXXX.XXXXXX] Hardware name: VMware, Inc. VMware7/440BX Desktop Reference Platform, BIOS VMW7
[XXXXXX.XXXXXX] RIP: 0010:error_entry+0x1a/0x150

...
[XXXXXX.XXXXXX] Call Trace:
[XXXXXX.XXXXXX] <#DF>
[XXXXXX.XXXXXX] ? __die_body+0x1a/0x60
[XXXXXX.XXXXXX] ? die+0x38/0x60
[XXXXXX.XXXXXX] ? exc_double_fault+0x175/0x190
[XXXXXX.XXXXXX] ? asm_exc_double_fault+0x1f/0x30
[XXXXXX.XXXXXX] ? early_xen_iret_patch+0xc/0xc
[XXXXXX.XXXXXX] ? asm_exc_page_fault+0x9/0x30
[XXXXXX.XXXXXX] ? error_entry+0x1a/0x150
[XXXXXX.XXXXXX] </#DF>

To determine if this is a match, you can check in the dmesg log with the following command:

grep "double fault" /var/crash/*dmesg* -A20

On Windows systems, the stack trace for the dump file will be similar to the following:

nt!KeBugCheckEx
nt!KiBugCheckDispatch+0x69
nt!KiDoubleFaultAbort+0x2bd
hal!HalProcessorIdle+0xf
nt!PpmIdleDefaultExecute+0x1b
nt!PpmIdleExecuteTransition+0x6bc
nt!PoIdle+0x33f
nt!KiIdleLoop+0x2c

Environment

Windows or Linux virtual machines running on ESXi 7.0 U3 or later releases and AMD Zen2/Zen3 CPU.

Cause

The cause of this is currently unknown and under investigation.

Resolution

This is a known issue and currently there is no resolution.

To workaround the issue you will have to reboot the VM to recover.

Broadcom Engineering and AMD are actively investigating to identify a workaround and/or fix.

If this can be reproduced please document the steps/workload etc & enable VM debug logging https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere/8-0/vsphere-virtual-machine-administration-guide-8-0/configuring-virtual-machine-optionsvsphere-vm-admin/configuring-virtual-machine-advanced-optionsvsphere-vm-admin/configure-debugging-and-statisticsvsphere-vm-admin.html

Select "Record Debugging Information" from the drop down. Note this can have a performance impact on VM's.

Additional Information

If you encounter this issue, please collect the diagnostic data outlined below and provide it to us.

For Linux systems:

1. Collect all of the data in the /var/crash directory of the VM.

2. Locate the ESXi host running the VM which crashed.

3. Locate all VMs which were running on this host at the time of the crash, and were not rebooted since then. You should also include VMs which were vMotioned after the crash and still running on other hosts.

4. For each VM from Step 2, including the original VM which crashed, run the following commands (replace <vmname> accordingly) and save the output:

uname -a > uname-<vmname>.txt

sysctl kernel.kptr_restrict (Remember the value X, high probably X=2)

sysctl -w kernel.kptr_restrict=0

cat /proc/kallsyms > kallsyms-<vmname>.txt

cat /proc/iomem > iomem-<vmname>.txt

sysctl -w kernel.kptr_restrict=X  (Restore original value from above, in case of X=2 the command should be "sysctl -w kernel.kptr_restrict=2")

5. Collect a log bundle from the host where the VM was running, as well as the output of the above commands for each VM identified in Step 3.

For Windows systems:

1. Collect any memory dumps from the VM, typically located in the following locations:

C:\Windows\Memory.dmp
C:\Windows\Minidump\

2. Locate all VMs which were running on the same host as the VM at the time of the crash, and were not rebooted since then. You should also include VMs which were vMotioned after the crash and still running on other hosts. Follow the below procedure for each of these VM's.

3. Download Process Explorer from Microsoft sysinternals: https://learn.microsoft.com/en-us/sysinternals/downloads/process-explorer

4. Open procexp64.exe as Administrator

5. Adjust the view by clicking on View -> Show lower pane (or Ctrl+L), then View -> Lower pane view -> DLLs (or Ctrl+D), then click on System.

6. Show Base address by clicking on View -> Select Columns -> DLL -> Base Address.

7. Save current view: File -> Save (or Ctrl+S) to <VM_Name>_System.txt.

8. Collect a log bundle from the host, the above file(s) created with Process Explorer, and the dump files.