vsan health service crash and generate many core.vsanvcmgmtd-wor.** core files which make /storage/core full

Products

VMware vCenter Server

Issue/Introduction

The vCenter VAMI UI shows a critical error, stating:

File system /storage/core has run out of storage space. Increase the size of disk /storage/core.

when reviewing the /storage/core partition (e.g. by running 'ls -ahl /storage/core'), a large number of core.vsanvcmgmtd-wor.** files can be found, allocating the majority of space in the partition.

the vSAN health management log, vsanvcmgmtd.log, contains backtraces like this one:

YYYY-MM-DDThh:mm:ss.199Z verbose vsanvcmgmtd[47698] [vSAN@6876 sub=PyBackedMO opId=76aa2907]  Enter vim.cluster.VsanPerformanceManager.queryVsanPerf, Pending: 77
YYYY-MM-DDThh:mm:ss.231Z panic vsanvcmgmtd[47733] [vSAN@6876 sub=Default opId=76a9de1c]
-->
--> Panic: 8541433253169: vim.cluster.VsanVcDiskManagementSystem.queryDiskMappings cannot be completed in 2700 seconds by thread 47677
--> Backtrace:
--> [backtrace begin] product: vsanvcmgmtd, version: 7.0.3, build: build-24322028, tag: vsanvcmgmtd, cpu: x86_64, os: linux, buildType: release
--> backtrace[00] libvmacore.so[0x0037DB8B]
--> backtrace[01] libvmacore.so[0x002C79C5]: Vmacore::System::Stacktrace::CaptureFullWork(unsigned int)
--> backtrace[02] libvmacore.so[0x002D6C5B]: Vmacore::System::SystemFactory::CreateBacktrace(Vmacore::Ref<Vmacore::System::Backtrace>&)
--> backtrace[03] libvmacore.so[0x00370CD7]
--> backtrace[04] libvmacore.so[0x00370DF3]: Vmacore::PanicExit(char const*)
--> backtrace[05] libPyCppVmomi.so[0x0003663F]
--> backtrace[06] libvmacore.so[0x0023B390]
--> backtrace[07] libvmacore.so[0x00234A37]
--> backtrace[08] libvmacore.so[0x00239F75]
--> backtrace[09] libvmacore.so[0x003765C0]
--> backtrace[10] libpthread.so.0[0x00007F87]
--> backtrace[11] libc.so.6[0x000F36BF]
--> backtrace[12] (no module)
--> [backtrace end]

additionally the log contains "vSAN health service query ESXi node info" tasks, that are lasting a long time (eg. 2581929 ms):

YYYY-MM-DDThh:mm:ss.xxZ info vsanvcmgmtd[32768] [vSAN@6876 sub=vmomi.soapStub[1] opId=b55c8e45] SOAP request returned HTTP failure; <SSL(<io_obj p:0x00007f30e8196078, h:315, <TCP '127.0.0.1 : 43220'>, <TCP '127.0.0.1 : 80'>>), /sdk>, method: fetchVsanSharedSecret; code: 500(Internal Server Error); fault: (vmodl.fault.SystemError) {
-->    faultCause = (vmodl.MethodFault) null,
-->    faultMessage = <unset>,
-->    reason = ""
-->    msg = "Received SOAP response fault from [<SSL(<io_obj p:0x00007f30e8196078, h:315, <TCP '127.0.0.1 : 43220'>, <TCP '127.0.0.1 : 80'>>), /sdk>]: fetchVsanSharedSecret
--> A general system error occurred: "
--> }
YYYY-MM-DDThh:mm:ss.xxZ warning vsanvcmgmtd[50162] [vSAN@6876 sub=HostMgr.host-xxxx opId=b55b1f31] Caught exception while recover session. EX: N5Vmomi5Fault11SystemError9ExceptionE(Fault cause: vmodl.fault.SystemError
--> )
YYYY-MM-DDThh:mm:ss.xxZ error vsanvcmgmtd[50162] [vSAN@6876 sub=Py2CppStub opId=b55b1f31]  EExit host-xxxx::vim.cluster.VsanPerformanceManager.queryNodeInformation (2581929 ms)
YYYY-MM-DDThh:mm:ss.xxZ warning vsanvcmgmtd[50162] [vSAN@6876 sub=Py2CppStub opId=b55b1f31] Exception while invoking VMOMI method 'host-5058::vim.cluster.VsanPerformanceManager.queryNodeInformation': N7Vmacore4Http13HttpExceptionE(HTTP error response: Service Unavailable)

furthermore the VSAN health service log, vmware-vsan-health-service.log contains errors similar to this one:

YYYY-MM-DDThh:mm:ss.xxZ ERROR vsan-mgmt[42112] [VsanVcPerformanceManagerImpl::PerHostThreadMain opID=b55b2616] hostname.domain.xxx: Exception: 503 Service Unavailable
Traceback (most recent call last):
File "bora/vsan/perfsvc/vpxd/vpxdPyMo/VsanVcPerformanceManagerImpl.py", line 136, in PerHostThreadMain
File "/usr/lib/vmware/site-packages/pyVmomi/VmomiSupport.py", line 595, in <lambda>
self.f(*(self.args + (obj,) + args), **kwargs)
File "/usr/lib/vmware/site-packages/pyVmomi/VmomiSupport.py", line 385, in _InvokeMethod
return self._stub.InvokeMethod(self, info, args)
http.client.HTTPException: 503 Service Unavailable

Reviewing the vmkernel.log of the ESXi host mentioned in vmware-vsan-health-service.log, you see memory admission failures similar to below:

YYYY-MM-DDThh:mm:ss.xxZ cpu33:238592230)MemSched: 14642: uw.238461124 (1706901938) extraMin/extraFromParent: 256/256, vsanperfsvc (56437) childEmin/eMinLimit: 38908/38912
YYYY-MM-DDThh:mm:ss.xxZ cpu33:238592230)MemSched: 14635: Admission failure in path: vsanperfsvc/python.238461124/uw.238461124
YYYY-MM-DDThh:mm:ss.xxZ cpu33:238592230)MemSched: 14635: Admission failure in path: vsanperfsvc/python.238461124/uw.238461124
YYYY-MM-DDThh:mm:ss.xxZ cpu33:238592230)MemSched: 14635: Admission failure in path: vsanperfsvc/python.238461124/uw.238461124

Environment

VMware vCenter 7.x

VMware vSAN 7.x

Resolution

To solve this issue, set the ESXi host in maintenance mode to evacuate its virtual machines, then reboot the ESXi host.

If DRS is not configured with automatic or half-automatic mode, you need to manually migrate the virtual machines in order to enable maintenance mode.