vSphere Fault Tolerance protected Linux VMs on HWv20 and later with the VMCI DMA datagrams feature may lockup or panic on ESXi8.0/8.0U1/8.0U2
search cancel

vSphere Fault Tolerance protected Linux VMs on HWv20 and later with the VMCI DMA datagrams feature may lockup or panic on ESXi8.0/8.0U1/8.0U2

book

Article ID: 313465

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

On ESXi 8.0/8.0U1/8.0U2, if Linux Virtual Machine:

  • Is protected by vSphere Fault Tolerance(FT).
  • And Virtual Machine is on HWv20 or later and the Linux kernel has VMCI DMA datagrams feature.

The Virtual Machine may suffer soft lockups, or kernel panic(which happens on some supported distros if kernel.softlockup_panic is set as 1).

Logs show messages like:
[  948.008390] watchdog: BUG: soft lockup - CPU#0 stuck for 22s!
...
[  976.011365] watchdog: BUG: soft lockup - CPU#0 stuck for 21s!


To determine the virtual hardware version of Virtual Machine, select the Virtual Machine from the vSphere Client and click the Summary tab. The Compatibility field displays the virtual hardware version. For instance, "ESXi 8.0 and later (Virtual Machine version 20)" in the filed means the Virtual Machine is on HWv20.

Note:

  • Non-FT Virtual Machines are not impacted.
  • FT-Virtual Machines older than HWv20, or without VMCI DMA datagrams feature is also not impacted.

The VMCI DMA datagrams feature was supported starting from upstream Linux 5.18. Linux distros supporting this feature includes:


RHEL:8.7/9.1 and later

Ubuntu: 22.04.1 LTS (kernel 5.15.0-41.44 onwards onwards) and 22.10 LTS

SUSE SLES: SLES15 SP3 (kernel 5.3.18-150300.59.93.1 onwards) and SLES15 SP4

Photon: Photon 4.0


Environment

VMware vSphere ESXi 8.0
VMware vSphere ESXi 8.0.1
VMware vSphere ESXi 8.0.2

Cause

The VMCI DMA datagrams feature is incompatible with Fault Tolerance. With Fault Tolerance turned on, the Linux Kernel vmw_vmci driver for the VMCI device ends up in a busy loop, causing the Virtual Machine to enter a soft lockup if the VMCI device is used.

Resolution

Currently there is no resolution to the issue. This will be fixed in a future release.

Workaround:

Post verifying FT Virtual Machine is impacted by checking virtual hardware version and GuestOS version, please follow the below mentioned steps to reconfigure the FT Virtual Machine:

  1. Power-Off the FT Virtual Machine

  2. Add the attribute/value vmci.dmaDatagramSupport = FALSE to Virtual Machine through vSphere Client. Please refer Set Advanced Virtual Machine Attributes for information on setting the advanced attributes for Virtual Machine .

  3. Power-On the FT Virtual Machine.