Hostd service crashing without core dumps on stateless hosts
search cancel

Hostd service crashing without core dumps on stateless hosts

book

Article ID: 437517

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • Recent storage outage on environment where datastores were completely unavailable for a time
  • ESXi marked as "not responding" in vCenter for a long period of time - it may be marked as connected for a time but it still has alerts in VC for "Host power state" 
  • ESXi UI available and hostd marked as running
  • Powered on VMs running fine and some tasks like create VM may succeed
  • When hostd is restarted, it'll start correctly but go into "not running" status after ~20seconds and hostd-probe will dump
  • Same powered off VM is registered on each ESXi host (as seen with vim-cmd vmsvc/getallvms)
  • That VM has reconfigure tasks or unregister tasks pending against it with timestamps from hours ago (seen with commands from KB Collecting information about tasks in VMware ESXi)
  • Hostd has messages similar to the below for the VM that's registered on multiple hosts:

<TIMESTAMP> Wa(164) Hostd[2099058]: [Originator@6876 sub=IoTracker] In thread 2099049, open("/vmfs/volumes/########-########-####-############/<VM_NAME>/<VM_NAME>.vmx.lck") took over 135141 sec.

  • Manually attempting to unregister the VM hangs or times out

Environment

vSphere 8.0 U3

 

Cause

During the storage issue, the VM attempted to failover and during this DRS attempted to place it multiple times on different hosts like the below example:

<TIMESTAMP> error vpxd[09695] [Originator@6876 sub=VmProv opID=CdrsLoadBalancer-########-########-01-01] Local-VC Host Migrate failed at vpx.vmprov.PrepareSource for poweredOn VM <VM_NAME>' (vm-####, ds:///vmfs/volumes/########-########-####-############/<VM_NAME>/<VM_NAME>.vmx) on host-#### (#.#.#.#) in pool resgroup-#### with ds ds:///vmfs/volumes/########-########-####-############/ to host-#### (#.#.#.#) in pool resgroup-####with ds ds:///vmfs/volumes/########-########-####-############/ with migId 394391#####4980737 with fault vim.fault.QuestionPending:  as Operation: Local-VC_DRS_NonMM_ComputevMotion
-->    text = "msg.hbacommon.locklost:The lock protecting '<VM_NAME>.vmdk' has been lost, possibly due to underlying storage issues.

 

Due to the storage issues the migrations failed but once the storage was available once more, the pending register tasks succeed on the hosts and they registered the VM to each of them

vCenter then will attempt to get information about the VM state which the hosts can't provide and cause hostd to become inconsistent

vpxd triggering tasks to unregister the VM from one or more hosts:

<TIMESTAMP> info vpxd[09735] [Originator@6876 sub=InvtVm opID=HB-host-####@17830-#########] Unregister discovered VM (vm-####, ds:///vmfs/volumes/########-########-####-############/<VM_NAME>/<VM_NAME>.vmx) based on URL. Current host : <HOSTNAME1>, Host where it was found: <HOSTNAME2>

These unregister tasks trigger faster than hostd can attempt them

Resolution

Clean up the multi-registered VM from vCenter and all ESXi - note this is to unregister it from the ESXi hosts and remove from inventory on vCenter; not to delete the VM from disk:

 

  1. Snapshot vCenter in accordance with KB Snapshot Best practices for vCenter Server Virtual Machines
  2. Use KB Manually removing stale or orphaned virtual machines from vCenter Server 7.x and 8.x. to remove the VM from vCenter's point of view. 
  3. Before starting vpxd again as part of the above KB; run the following command in the vCenter database to mark the ESXi hosts as disconnected - this is to ensure that the ESXi hosts don't push the VM entry to vCenter again:

    update vpx_host set enabled = 0; 
  4. Start vpxd service
  5. Reboot all ESXi hosts - this will clean up hostd and as the hosts are stateless, the problematic VM will be automatically unregistered
  6. Validate the VM is gone from the ESXi hosts themselves and then reconnect to vCenter as needed

 

Additional Information

For stateful hosts, the VM entry will need to be manually unregistered after reboot and before reconnecting to vCenter.

Get ID for the VM on host level:

vim-cmd vmsvc/getallvms

Unregister VM:

vim-cmd vmsvc/unregister <ID>