Troubleshooting ESXi host lockup where console and network are unresponsive but hardware logs are clean.
search cancel

Troubleshooting ESXi host lockup where console and network are unresponsive but hardware logs are clean.

book

Article ID: 414342

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

 

  • ESXi host locked up and stopped logging, and it is "Not Responding" in the vCenter around the timestamp. 
  • There is a gap in vmkernel logging during the time the host was unresponsive.
  • The inability to respond to the F2 key on the Out-of-Band Remote Console is the clearest indicator that the entire ESXi kernel was frozen and not processing interrupts.
  • You may or may not see a PSOD.
  • Just before the crash, you will see many of the following entries in the vmkernel.log
    YYYY-MM-DDTHH:MM:SS.###Z In(182) vmkernel: cpu44:2106112)NetPort: 708: Failed to acquire port non-exclusive lock 0x400000f[Failure].
    YYYY-MM-DDTHH:MM:SS.###Z In(182) vmkernel: cpu4:2106082)NetPort: 708: Failed to acquire port non-exclusive lock 0x4000012[Failure].
    YYYY-MM-DDTHH:MM:SS.###Z In(182) vmkernel: cpu58:2106112)NetPort: 708: Failed to acquire port non-exclusive lock 0x400000f[Failure].

     

    Additional symptoms reported:
  • The ESXi host stalled, triggering an HA event in the vSphere cluster and moving all VMs to other hosts.
  • The host appears to have locked up, but it did not result in any purple screen. Once the host was power-cycled, it started again.

Environment

VMware vSphere ESXi 8.X

Vmware vSphere ESXi 7.X

Cause

ESXi Kernel Deadlock caused by buggy driver / incompatible NIC driver/firmware

  • NetPort: 708: Failed to acquire port non-exclusive lock 0x40000XX[Failure]. This error signifies that processes were failing to acquire necessary kernel locks to manage networking resources (virtual switch ports, etc.). This issue escalates under high stress and is a known precursor to an ESXi host lockup or PSOD, indicating a severe race condition or a buggy driver holding the lock indefinitely.

 

 

Resolution

  1. If the issue reoccurs, check if logging under /var/log is up to date with the command below
    ls -lthra 
  2. OR 
    tail vmkernel.log and date command
  3. If the logs are not up to date, engage the Hardware vendor to assist with sending an NMI to the ESXi host. This should be done before rebooting the host. Read more at Using hardware NMI facilities to troubleshoot unresponsive hosts
  4. Once the dump has been collected, open a case with Broadcom and provide the dump. 
  5. Before forcing an NMI or opening a case for deeper analysis, validate the current host configuration against the supported matrix. Incompatible drivers or firmware can often be the root cause of kernel deadlocks, especially related to the network stack.
  6. Verify that the ESXi version & NIC firmware, and driver versions are compatible by checking the relevant Broadcom Compatibility Guide - https://compatibilityguide.broadcom.com/
  7. Power cycle the server