ESXi host shows "Not Responding" in vCenter due to hostd unresponsiveness and DCBD errors
search cancel

ESXi host shows "Not Responding" in vCenter due to hostd unresponsiveness and DCBD errors

book

Article ID: 433257

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

An ESXi host enters a "Not Responding" state in vCenter Server. This occurs when the hostd service, responsible for host-level management and communication with vCenter, becomes unresponsive or hangs due to underlying hardware interaction failures.

Symptoms:

  • ESXi host status in vCenter Server is Not Responding.

  • Management agents may be unreachable via SSH or local console.

  • /var/log/vmkwarning.log:

    YYYY-MM-DDTHH:MM:SSZ vmkalert: ALERT: hostd detected to be non-responsive
    
  • /var/log/syslog.log:

    dcbd: [error] set_hw_pg: Failed status(195887136)
    vmkbacktrace: Creating trace file /var/run/log/hostd-probed-#######.dmp
    
  • The set_hw_pg: Failed error indicates a VMkernel failure to allocate memory pages for hardware-related processes, often tied to the Data Center Bridging Daemon (DCBD).

 

 

Environment

  • VMware ESXi 8.x

Cause

The issue is triggered by an unresponsiveness or crash in the Network Interface Card (NIC) driver, typically stemming from a mismatch between the driver and the physical firmware. When the DCBD service attempts to interact with the NIC hardware to manage priority flow control or bandwidth allocation and the hardware fails to respond, the management agents (specifically hostd) can hang while waiting for the hardware I/O to complete.

 

Resolution

To restore management connectivity and prevent recurrence, follow these steps:

1. Temporary Workaround: Disable the DCBD Service If the host is accessible via SSH or ESXi Shell, stopping the DCBD service can mitigate the unresponsiveness without requiring a reboot in some scenarios.

  • Log in as root via SSH.

  • Stop the DCBD service:

    /etc/init.d/dcbd stop
    
  • Monitor host connectivity in vCenter. Note: If hostd is completely hung, a host reboot may be required to clear the zombie processes.

2. Verify Physical and Driver Compatibility

  • Identify the current NIC driver and firmware versions using:

    esxcli network nic get -n vmnicX
    
  • Cross-reference the versions with the VMware Compatibility Guide (VCG).

  • Special Note for ESXi 8.0: Modern high-speed drivers (e.g., nmlx5_core) are highly sensitive to firmware revisions. Ensure they are precisely aligned with the VCG.

3. Remediation with Hardware Vendor

  • Update the NIC firmware and driver to a supported, matching pair as recommended by the hardware vendor (e.g., Dell, HPE, Cisco).

  • If the issue persists after updates, perform hardware diagnostics to rule out a failing PCIe device or CNA.

 

Additional Information