vCenter reported "Host cannot communicate with other hosts" however there was never a network partition

Products

VMware vSAN

Issue/Introduction

Symptoms:

vCenter reported an alarm "One host cannot communicate with other hosts and after a while the error disappeared from the vSAN health plugin.
From the vCenter server: - vmware-vsan-health-service-991.log; you may see vCenter could not fetch vmk information used for vSAN/unicast on Host:<HOSTNAME> during unicast check:

2019-01-02T04:58:38.509Z WARNING vsan-health[100e4ede-0e4b-11e9] [VsanVcClusterHealthSystemImpl::_GetConsistentConfigTest] Vmknic not present on host <hostname>.cityofmarion.lan, skip testing unicast

From the vCenter health report file: we see that health plugin reported vmknic on the host-1808 had issues howvever it also reported that "Cluster partition" state was green/good.

2019-01-02T04:58:40.55Z INFO vsan-health[healthThread-CloudHealthSender-22884] [VsanHealthSummaryLogUtil::PrintHealthResult] Cluster COM vxRail Overall Health : red
   Group network health : red
      Test hostdisconnected health : green
      Test hostconnectivity health : red
         HostsWithCommunicationIssues: Host
        (Host-88320),
      Test clusterpartition health : green
      Test vsanvmknic health : red
HostsWithNoVsanVmknicPresent: Host
(Host-88320),
      Test matchingsubnet health : yellow
VsanIpSubnetConfigurations: Host IpSubnet(S)
(Host-88316, 10.3.1.0/24), (Host-106228, 10.3.1.0/24), (Host-88318, 10.3.1.0/24), (Host-106233, 10.3.1.0/24), (Host-88312, 10.3.1.0/24), (Host-88320, ''),
(Host-106239, 10.3.1.0/24), (Host-88314, 10.3.1.0/24),
      Test smallping health : green
      Test largeping health : green

You may not see any host being partitioned from clomd.log; hostd and vpxa.log. Clomd may never report any change state or node drop count.
From vsanmgmt.0 you may see "cmmds-tool" was reporting "Cannot allocate memory"

2019-01-02T04:58:38Z VSANMGMTSVC: ERROR vsanperfsvc[117160c8-0e4b-11e9] [VsanStretchedClusterSystemImpl::GetStretchedClusterInfoFromCmmds] Failed to get stretched cluster info from cmmds: Running cmd ['/bin/cmmds-tool', 'find', '--format=python', '-t', 'NODE'] with error /bin/cmmds-tool: error while loading shared libraries: libpthread.so.0: failed to map segment from shared object: Cannot allocate memory
2019-01-02T04:58:38Z VSANMGMTSVC: ERROR vsanperfsvc[117160c8-0e4b-11e9] [VsanStretchedClusterSystemImpl::GetStretchedClusterInfoFromCmmds] Running cmd ['/bin/cmmds-tool', 'find', '--format=python', '-t', 'NODE'] with
error /bin/cmmds-tool: error while loading shared libraries: libpthread.so.0: failed to map segment from shared object: Cannot allocate memory Traceback (most recent call last): File "/build/mts/release/bora-10390
117/bora/build/esxvsan/release/vsan/usr/lib/vmware/vsan/perfsvc/VsanStretchedClusterSystemImpl.py", line 349, in GetStretchedClusterInfoFromCmmds File "/usr/lib/vmware/hostd/hmo/VsanInternalSystem.py", line 69, in
_cmmds_find raise Exception('Running cmd %s with error %s' % (cmd, err)) Exception: Running cmd ['/bin/cmmds-tool', 'find', '--format=python', '-t', 'NODE'] with error /bin/cmmds-tool: error while loading shared
libraries: libpthread.so.0: failed to map segment from share
2019-01-02T04:58:38Z VSANMGMTSVC: d object: Cannot allocate memory

vmkwarning: reports cmmds-tool reporting " Admission check failed for memory resource

2019-01-02T04:59:59.847Z cpu16:64944763)WARNING: User: 4530: cmmds-tool: Error in initial cartel setup: Failed late cartel initialization: Admission check failed for memory resource

From vsansystem.log: cmmds module in inaccessible state and busy. This is due to out of memory on cmmds module as reported in the above events.
Further inspecting the vsansystem.log, we see cmmds reporting memory exhaustion hence it failed to get membership count.
Hence the node count was reported as "nodeCount: 0" and that caused the network health alarm on the vCenter. This is a false positive alert where there was no network partition.

2019-01-02T04:58:32.462Z info vsansystem[A2E0BEB700] [Originator@6876 sub=Libs opID=0d939440-0e4b-11e9] ModuleImpl Refresh: Module became inaccessible: cmmds
2019-01-02T04:58:32.464Z error vsansystem[A2E0BEB700] [Originator@6876 sub=VsanSystemProvider opID=0d939440-0e4b-11e9] Error querying host status: Unable to load module /usr/lib/vmware/vmkmod/cmmds: Busy
2019-01-02T04:58:33.292Z info vsansystem[A2E001A700] [Originator@6876 sub=VsanSystemProvider opID=CMMDSAccessUpdate-a6b0] Timer fired: posting access generation update
2019-01-02T04:58:33.292Z info vsansystem[A2E001A700] [Originator@6876 sub=VsanSystemProvider opID=CMMDSAccessUpdate-a6b0] Posting VSAN access generation update (genNo: '1143', batched: 20, now: 1899377715332, lastUpdate: 1899316714310)
2019-01-02T04:58:33.293Z info vsansystem[A2E01E0700] [Originator@6876 sub=Libs opID=CMMDSAccessUpdate-a6b0] ModuleImpl Refresh: Module became inaccessible: cmmds
2019-01-02T04:58:33.294Z error vsansystem[A2E01E0700] [Originator@6876 sub=VsanSystemProvider opID=CMMDSAccessUpdate-a6b0] Error retrieving membership list: Unable to load module /usr/lib/vmware/vmkmod/cmmds: Busy
2019-01-02T04:58:33.294Z info vsansystem[A2E01E0700] [Originator@6876 sub=VsanSystemProvider opID=CMMDSAccessUpdate-a6b0] Complete, nodeCount: 0, runtime info: (vim.vsan.host.VsanRuntimeInfo) {
--> membershipList = <unset>,
--> diskIssues = <unset>,
--> accessGenNo = 1143
--> }

++ vsanmgmt: memory was being reported on vsanperfsvc much before the issue window and during the window

2019-01-02T03:35:38Z VSANMGMTSVC: INFO vsanperfsvc[MainThread] [statsdaemon::_logDaemonMemoryStats] Daemon memory stats: eMin=152.132
MB, eMinPeak=155.648MB, rMinPeak=157.820MB MEMORY PRESSURE
2019-01-02T03:35:38Z VSANMGMTSVC: INFO vsanperfsvc[MainThread] [statsdaemon::_logDaemonMemoryStats] 'python.59873341' memory stats:
eMin=152.132MB, eMinPeak=155.284MB, rMinPeak=155.916MB MEMORY PRESSURE

Environment

VMware vSAN 6.6.x

Cause

Partition and vSAN health warning are all related to out of memory of vSAN management daemon.

Resolution

Out of Memory on vSAN management daemon is now fixed in 6.5 U2 Patch 03.

Workaround:
At the moment, there is no workaround for this. This a false positive error which may report network partition.