ESXi hosts are not responding in vCenter due to failing to allocate objectCacheHeap

search cancel

ESXi hosts are not responding in vCenter due to failing to allocate objectCacheHeap

book

Article ID: 314219

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

ESXi hosts are not responding in vCenter due to failing to allocate objectCacheHeap.
After trying to restart hostd process, the host becomes unresponsive again.

Environment

VMware vSphere ESXi 7.x

Cause

An Object Cache (OC) exist on filesystems l(such as VMFS, NFS, DevFS) and stores cache data for all opened objects.
As long as the file or volume is being accessed, the corresponding OC object remains referenced. Once all references to the file or volume are closed, the related OC object is supposed to be flushed from the cache. However, in this case, the OC entries fail to be flushed for some reason.

Resolution

Issue is resolved in vSphere ESX 7.0 U3i build 20842708

Additional Information

We can check the number of allocation failures of objectCacheHeap using vsish.
You can follow the example below and 0x430b66800000 will be different.

/system/heaps/objectCache-0x430b66800000/> cat stats
Heap stats {
Name:objectCache
owning module id:0
dynamically growable:1
physical contiguity: 1 -> Any Physical Contiguity
lower memory PA limit:0
upper memory PA limit:-1
may use reserved memory:0
memory pool:228
# of ranges allocated:1
dlmalloc overhead:1032
current heap size:34099624
initial heap size:131072
current bytes allocated:34074808
current bytes available:24816
current bytes releasable:288
percent free of current size:0
percent releasable of current size:0
maximum heap size:34099624
maximum bytes available:24816
percent free of max size:0
lowest percent free of max size ever encountered:0
# of failure messages:0
number of succeeded allocations:927979155
number of failed allocations:680826 <-- Here
number of freed allocations:927913628
average size of an allocation:98304
number of requests we try to satisfy per heap growth:2
number of heap growth operations:53068
number of heap shrink operations:20687
Size of the physical pages backing this heap.:4096
}

Workaround:

If we can identify which volumes are related to the issue, keeping the affected datastore open can be an effective workaround.

Below is an example from ESXi - /var/log/vmkernel.log where 'TEST' is the name of an affected datastore.

$ grep -B3 -i evict vmkernel.log
[YYYY-MM-DDTHH:MM:SS] cpu67:2097774)Res3: 2572: Failed to lock cluster 17 (typeID 6) after 10 tries, aborting: caller 0x4200086850f4 vol TEST
[YYYY-MM-DDTHH:MM:SS] cpu67:2097774)WARNING: Vol3: 2848: 'TEST': Failed to clear journal address since JBC could not be Locked. This could result in leak of journal block at <type 6 addr 33554449>.
[YYYY-MM-DDTHH:MM:SS] cpu67:2097774)WARNING: Vol3: 2916: 'TEST': Failed to clear journal address in on-disk HB. This could result in leak of journal block at <type 6 addr 33554449>.
[YYYY-MM-DDTHH:MM:SS] cpu67:2097774)Vol3: 4191: Error closing the volume: Failure. Eviction fails.

Using an SSH session, change the directory to /vmfs/volumes/TEST .The SSH session accessing this directory will then keep it open and workaround the issue.

Feedback

thumb_up Yes

thumb_down No