ESXi hosts are not responding in vCenter due to failing to allocate objectCacheHeap
search cancel

ESXi hosts are not responding in vCenter due to failing to allocate objectCacheHeap

book

Article ID: 314219

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • ESXi hosts are not responding in vCenter due to failing to allocate objectCacheHeap.

  • After trying to restart hostd process, the host becomes unresponsive again.

Environment

VMware vSphere ESXi 7.x

Cause

  • An Object Cache (OC) exist on filesystems l(such as VMFS, NFS, DevFS) and stores cache data for all opened objects.

  • As long as the file or volume is being accessed, the corresponding OC object remains referenced. Once all references to the file or volume are closed, the related OC object is supposed to be flushed from the cache. However, in this case, the OC entries fail to be flushed for some reason.

Resolution

Issue is resolved in vSphere ESX 7.0 U3i build 20842708

Additional Information

  • We can check the number of allocation failures of objectCacheHeap using vsish.
  • You can follow the example below and 0x430b66800000 will be different.
/system/heaps/objectCache-0x430b66800000/> cat stats
Heap stats {
   Name:objectCache
   owning module id:0
   dynamically growable:1
   physical contiguity: 1 -> Any Physical Contiguity
   lower memory PA limit:0
   upper memory PA limit:-1
   may use reserved memory:0
   memory pool:228
   # of ranges allocated:1
   dlmalloc overhead:1032
   current heap size:34099624
   initial heap size:131072
   current bytes allocated:34074808
   current bytes available:24816
   current bytes releasable:288
   percent free of current size:0
   percent releasable of current size:0
   maximum heap size:34099624
   maximum bytes available:24816
   percent free of max size:0
   lowest percent free of max size ever encountered:0
   # of failure messages:0
   number of succeeded allocations:927979155
   number of failed allocations:680826 <-- Here
   number of freed allocations:927913628
   average size of an allocation:98304
   number of requests we try to satisfy per heap growth:2
   number of heap growth operations:53068
   number of heap shrink operations:20687
   Size of the physical pages backing this heap.:4096
}


Workaround:

  • If we can identify which volumes are related to the issue, keeping the affected datastore open can be an effective workaround.     
  • Below is an example from ESXi - /var/log/vmkernel.log where 'TEST' is the name of an affected datastore.
            
    $ grep -B3 -i evict vmkernel.log
    2022-10-06T22:09:37.898Z cpu67:2097774)Res3: 2572: Failed to lock cluster 17 (typeID 6) after 10 tries, aborting: caller 0x4200086850f4 vol TEST
    2022-10-06T22:09:37.898Z cpu67:2097774)WARNING: Vol3: 2848: 'TEST': Failed to clear journal address since JBC could not be Locked. This could result in leak of journal block at <type 6 addr 33554449>.
    2022-10-06T22:09:37.898Z cpu67:2097774)WARNING: Vol3: 2916: 'TEST': Failed to clear journal address in on-disk HB. This could result in leak of journal block at <type 6 addr 33554449>.
    2022-10-06T22:09:37.898Z cpu67:2097774)Vol3: 4191: Error closing the volume: Failure. Eviction fails.

    Using an SSH session, change the directory to /vmfs/volumes/TEST .The SSH session accessing this directory will then keep it open and workaround the issue.