Virtual Machine Unresponsiveness Following NVMe Command Aborts and Controller Recovery in ESXi 8.0.3
search cancel

Virtual Machine Unresponsiveness Following NVMe Command Aborts and Controller Recovery in ESXi 8.0.3

book

Article ID: 436611

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • Virtual Machine running on NVMe backed datastores become unresponsive

  • Guest OS becomes unresponsive with "watchdog: BUG: soft lockup" errors in the kernel ring buffer.

Environment

VMware vSphere ESXi 8.x

VMware vSphere ESXi 9.x

Cause

The unresponsiveness is caused by I/O stalls and aborts where NVMe controllers cannot process I/O and reports to ESXi that it is entering recovery mode. 

/var/log/vmkwarning.log reports:
WARNING: NVMEIO:4346 Controller ### in state 8 or in recovery mode, bail out.

During this state transition, outstanding I/O operations are blocked or delayed beyond the guest operating system's internal watchdog thresholds.

ESXi may report:

  • Driver aborts of I/O in /var/log/vmkwarning.log: WARNING: lpfc: lpfc_xmit_admin_cmd:1204: vmhbaX 1202 NVMe_Abort.
  • 0x371 (Command Aborted by Host): The ESXi host aborted a pending command due to a timeout, aiming to prevent an I/O hang.
  • 0xc (Keep Alive Timeout): The "Keep Alive" timer expired, indicating the NVMe controller didn’t respond in time, triggering a recovery/reset.
  • VSCSI warnings indicate WaitForCIF issuing resets and ignoring double resets.

 

Resolution

To resolve this issue, the following actions must be taken:

  • Identify the specific NVMe controller and path experiencing the aborts (e.g., vmhba# or vmhba# and CTLR ##).

  • Engage your storage vendor to investigate and remediate the controller issues.