How to determine ESXi hostd unresponsivesness and data to be captured
search cancel

How to determine ESXi hostd unresponsivesness and data to be captured

book

Article ID: 340041

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

The scope of this document is only to troubleshoot ESXi Server hostd unresponsiveness and the  data that needs to be gathered for further analysis for investigation.

  • Unable to Manage ESXi Server as well as Virtual Machines due to "Not Responding" or "Disconnected" state of the server.
  • There can be several other issues which can lead into "Not Responding"/"Disconnected" state of the ESXi Server . See KB -  Troubleshooting
    an ESXi host in a "not responding" state

Environment

VMware vSphere ESXi 7.x

VMware vSphere ESXi 8.x

Cause


 

Resolution

    1. Detect the non-responsive hostd:
      1. Check "hostd detected to be non-responsive" alert message in the vmkernel* logs.

      2. Check host-probe* logs and locate timeout messages or hostd log not getting updated.
        You will see similar to below entries.
        yyyy-mm-ddThh:mm:ss.369Z info hostd-probe[9179840] [Originator@6876 sub=Default] Logging uses fast path: true
        yyyy-mm-ddThh:mm:ss.369Z info hostd-probe[9179840] [Originator@6876 sub=Default] The bora/lib logs WILL be handled by VmaCore
        yyyy-mm-ddThh:mm:ss.369Z info hostd-probe[9179840] [Originator@6876 sub=Default] Initialized channel manager
        yyyy-mm-ddThh:mm:ss.369Z info hostd-probe[9179840] [Originator@6876 sub=Default] Current working directory: /var/log/vmware
        yyyy-mm-ddThh:mm:ss.369Z info hostd-probe[9179840] [Originator@6876 sub=FairScheduler] Priority level 4 is now active.
        yyyy-mm-ddThh:mm:ss.369Z info hostd-probe[9179840] [Originator@6876 sub=FairScheduler] Priority level 8 is now active.
        yyyy-mm-ddThh:mm:ss.369Z info hostd-probe[9179840] [Originator@6876 sub=FairScheduler] Priority level 16 is now active.
        yyyy-mm-ddThh:mm:ss.369Z info hostd-probe[9179840] [Originator@6876 sub=Default] Syscommand enabled: true
        yyyy-mm-ddThh:mm:ss.369Z info hostd-probe[9179840] [Originator@6876 sub=Default] ReaperManager Initialized
        yyyy-mm-ddThh:mm:ss.369Z info hostd-probe[9179840] [Originator@6876 sub=Default] Current process ID: 247108
        yyyy-mm-ddThh:mm:ss.369Z warning hostd-probe[9179840] [Originator@6876 sub=Default] Timeout: N7Vmacore16TimeoutExceptionE(Operation timed out)
        --> [context]zKq7AUoCAgAAAItAagAMaG9zdGQtcHJvYmUAAL/FLWxpYnZtYWNvcmUuc28AADJaEgAP0A0AYPkVAWpFDWxpYnZtb21pLnNvAAFsSw0BwIUPAjmsxmxpYnZpbS10eXBlcy5zbwADXkEAaG9zdGQtcHJvYmUAA68yAARniwFsaWJjLnNvLjYAA2k2AA==[/context]
        hostd detected to be non-responsive

      3. Running command vim-cmd /vmsvc/getallvms status may not give any output.


    NOTE: Restarting the management agents may impact any tasks that are running on the ESXi host at the time of the restart

    Check for any storage issues before restarting the Host deamon hostd service or services.sh

    Refer to Restarting the Management agents in ESXi

          B.    Stop hostd using service or hostd command. For more information, see:      C. Alternatively VM shutdown method if you have command line available through Putty or DCUI shell to the host and can’t access the VMs directly for some reason. See Unable to Power off a Virtual Machine in an ESXi host  
       
         
            Command to see if a VM is running on a ESXi host and get the World ID: # localcli vm process list     
         
            Command to shutdown a VM : # localcli vm process kill -t soft -w <worldID>         
       
       
        *Using 'soft', as above, is the most graceful shutdown. If that doesn't work, use 'hard' instead to perform an immediate shutdown. The option 'force' should be used as a last resort.

    NOTE: It is important that any underlying storage issue is  fixed for hostd service to respond properly.
  • Before rebooting (If you are doing it as a last resort only- DO NOT attempt to reboot  if you are  using VMware vSAN, Hyperconverged Infrastructure Servers like Nutanix, Cisco Hyperflex etc.,), follow these steps to get the logs that will be needed if further analysis of the cause should be investigated.   
         
            Create hostd dump from memory by running this on the host:   vmkbacktrace -n hostd -c -w     
         
            Check that it is there with the output of this command:    ls -alrth /var/core/hostd*     
       
                      *looks like: rwx------    1 root     root       32.8M Aug 15 05:10 /var/core/hostd-worker-zdump.001   
         
            Connect to the host with WinSCP, Filezilla, etc., and download the file.     
       
      Reboot the host, ensure it is connecting to the vCenter and looks healthy, and turn VMs on/migrate VMs back to the ESXi host as needed.

    1. To collect hostd live core (hostd-worker-zdump.*) run this command. 

      vmkbacktrace -n hostd -c -w

    2. To collect vmkernel-zdump without affecting the running VMs

             localcli --plugin-dir /usr/lib/vmware/esxcli/int/ debug livedump perform     
             esxcfg-dumppart -C -D active

The core dump will then be gathered. This process can take some time, as it does during a PSOD. When the process is completed, you will be returned to the command prompt.

Once the core dump has been collected and the process is finished, gather a vm-support bundle to collect the logging, system state and livecore for root cause analysis.

Open a Broadcom Support case with data captured.



Additional Information