How to determine ESXi hostd unresponsivesness and data to be captured

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

The scope of this document is only to troubleshoot ESXi Server hostd unresponsiveness and the data that needs to be gathered for further analysis for investigation.

Unable to Manage ESXi Server as well as Virtual Machines due to "Not Responding" or "Disconnected" state of the server.
There can be several other issues which can lead into "Not Responding"/"Disconnected" state of the ESXi Server . See KB - Troubleshooting an ESXi host in a "not responding" state

Environment

VMware vSphere ESXi 6.x

VMware vSphere ESXi 7.x

VMware vSphere ESXi 8.x

Cause

Resolution

1. Detect the non-responsive hostd:
  1. Check "hostd detected to be non-responsive" alert message in the vmkernel* logs.
  2. Check host-probe* logs and locate timeout messages or hostd log not getting updated.
    You will see similar to below entries.
    yyyy-mm-ddThh:mm:ss.369Z info hostd-probe[9179840] [Originator@6876 sub=Default] Logging uses fast path: true
    yyyy-mm-ddThh:mm:ss.369Z info hostd-probe[9179840] [Originator@6876 sub=Default] The bora/lib logs WILL be handled by VmaCore
    yyyy-mm-ddThh:mm:ss.369Z info hostd-probe[9179840] [Originator@6876 sub=Default] Initialized channel manager
    yyyy-mm-ddThh:mm:ss.369Z info hostd-probe[9179840] [Originator@6876 sub=Default] Current working directory: /var/log/vmware
    yyyy-mm-ddThh:mm:ss.369Z info hostd-probe[9179840] [Originator@6876 sub=FairScheduler] Priority level 4 is now active.
    yyyy-mm-ddThh:mm:ss.369Z info hostd-probe[9179840] [Originator@6876 sub=FairScheduler] Priority level 8 is now active.
    yyyy-mm-ddThh:mm:ss.369Z info hostd-probe[9179840] [Originator@6876 sub=FairScheduler] Priority level 16 is now active.
    yyyy-mm-ddThh:mm:ss.369Z info hostd-probe[9179840] [Originator@6876 sub=Default] Syscommand enabled: true
    yyyy-mm-ddThh:mm:ss.369Z info hostd-probe[9179840] [Originator@6876 sub=Default] ReaperManager Initialized
    yyyy-mm-ddThh:mm:ss.369Z info hostd-probe[9179840] [Originator@6876 sub=Default] Current process ID: 247108
    yyyy-mm-ddThh:mm:ss.369Z warning hostd-probe[9179840] [Originator@6876 sub=Default] Timeout: N7Vmacore16TimeoutExceptionE(Operation timed out)
    --> [context]zKq7AUoCAgAAAItAagAMaG9zdGQtcHJvYmUAAL/FLWxpYnZtYWNvcmUuc28AADJaEgAP0A0AYPkVAWpFDWxpYnZtb21pLnNvAAFsSw0BwIUPAjmsxmxpYnZpbS10eXBlcy5zbwADXkEAaG9zdGQtcHJvYmUAA68yAARniwFsaWJjLnNvLjYAA2k2AA==[/context]
    hostd detected to be non-responsive
  3. Running command vim-cmd /vmsvc/getallvms status may not give any output.
NOTE: Restarting the management agents may impact any tasks that are running on the ESXi host at the time of the restart

Check for any storage issues before restarting the Host deamon hostd service or services.sh

Refer to Restarting the Management agents in ESXi (1003490)

B. Stop hostd using service or hostd command. For more information, see:
- hostd is not responding and cannot be killed by the kill -9 command (1007261)
- Service mgmt-vmware restart may not restart hostd in ESX/ESXi (1005566)
C. Alternatively VM shutdown method if you have command line available through Putty or DCUI shell to the host and can’t access the VMs directly for some reason. See Unable to Power off a Virtual Machine in an ESXi host


        Command to see if a VM is running on a ESXi host and get the World ID: # localcli vm process list

        Command to shutdown a VM : # localcli vm process kill -t soft -w <worldID>

*Using 'soft', as above, is the most graceful shutdown. If that doesn't work, use 'hard' instead to perform an immediate shutdown. The option 'force' should be used as a last resort.

NOTE: It is important that any underlying storage issue is fixed for hostd service to respond properly.

Before rebooting (If you are doing it as a last resort only- DO NOT attempt to reboot if you are using VMware vSAN, Hyperconverged Infrastructure Servers like Nutanix, Cisco Hyperflex etc.,), follow these steps to get the logs that will be needed if further analysis of the cause should be investigated.

        Create hostd dump from memory by running this on the host:   vmkbacktrace -n hostd -c -w

        Check that it is there with the output of this command:    ls -alrth /var/core/hostd*

                  *looks like: rwx------    1 root     root       32.8M Aug 15 05:10 /var/core/hostd-worker-zdump.001

        Connect to the host with WinSCP, Filezilla, etc., and download the file.

Reboot the host, ensure it is connecting to the vCenter and looks healthy, and turn VMs on/migrate VMs back to the ESXi host as needed.

1. To collect hostd live core (hostd-worker-zdump.*) run this command.

vmbacktrace -n hostd -c -w

2. To collect vmkernel-zdump without affecting the running VMs

localcli --plugin-dir /usr/lib/vmware/esxcli/int/ debug livedump perform
esxcfg-dumppart -C -D active

The core dump will then be gathered. This process can take some time, as it does during a PSOD. When the process is completed, you will be returned to the command prompt.

Once the core dump has been collected and the process is finished, gather a vm-support bundle to collect the logging, system state and livecore for root cause analysis.

Open a Broadcom Support case with data captured.

Additional Information

For more information, see:

For more information on avoiding common storage related issues, see