vSAN Health Service - Cluster Health – vSAN daemon liveness check

Products

VMware vSAN

Issue/Introduction

This article explains the Cluster Health – CLOMD liveness check in the vSAN Health Service and provides details on why it might report an error.

CLOMD (Cluster Level Object Manager Daemon) plays a key role in the operation of a vSAN cluster. It runs on every ESXi host and is responsible for new object creation, initiating repair of existing objects after failures, all types of data moves and evacuations (For example: Enter Maintenance Mode, Evacuate data on disk removal from vSAN, maintaining balance and thus triggering rebalancing, implementing policy changes, etc.)

It does not actually participate in the data path, but it triggers data path operations and as such is a critical component during a number of management workflows and failure handling scenarios.

Virtual machine power on, or Storage vMotion to vSAN are two operations where CLOMD is required (and which are not that obvious), as those operations require the creation of a swap object, and object creation requires CLOMD.

Similarly, starting with vSAN 6.0, memory snapshots are maintained as objects, so taking a snapshot with a memory state will also require the CLOMD.

EPD (Entry Persistence Daemon) is a user space daemon that runs on every host that is part of the vSAN cluster. The main job of EPD is to make sure there is no component leakage when objects are deleted.

CMMDSD (Cluster Monitoring, Membership, and Directory Service Daemon) is a daemon to persist CMMDS (Cluster Monitoring, Membership and Directory Service) directory contents. It loads CMMDS user world process and provides an interface to CMMDS. CMMDS is responsible for monitoring the links to the cluster and acts as a primary distribution fabric for cluster metadata. It is also responsible for maintaining the state of cluster health and network links. Other modules use this information to know which nodes are part of the cluster and also which are the healthy interfaces for these nodes.

As of version 8.0U2 OSFSD & CMMDSTIMEMACHINED have been added to this health check. For versions prior to 8.0U2, you won't see this in the health check.

OSFSD (Object Store File System Daemon) is a daemon running on ESXi host that provides a distributed file system for storing and managing virtual machine data. It enables vSAN to offer file services in addition to its primary storage functionality.

CMMDSTIMEMACHINED (Cluster Monitoring, Membership, and Directory Service Time Machine Daemon) is a daemon running on ESXi host that is responsible for maintaining historical metadata records for object versions. It enables the vSAN cluster to recover from metadata inconsistencies by providing a way to roll back metadata changes to a previous version.

Environment

VMware vSAN 6.x

VMware vSAN 7.x

VMware vSAN 8.x

Resolution

Q: What does the “Cluster health – vSAN daemon liveness check (Former: vSAN CLOMD liveness check)” check do?

It checks if CLOMD, EPD, CMMDSD, OSFSD (added in 8.0 U2), and CMMDSTIMEMACHINED (added in 8.0 U2) are alive or not. For CLOMD, it does so by first checking that the service is running on all ESXi hosts, and then contacting the service to retrieve run-time statistics to verify that CLOMD can respond to inquiries. For EPD, CMMDSD, OSFSD, and CMMDSTIMEMACHINED, it checks whether the service is running properly on all ESXi hosts.

Note: This does not ensure that all of the functionalities discussed above (For example: Object creation, rebalancing) actually work, but it gives a first level assessment as to the health of CLOMD, EPD, CMMDSD, OSFSD, and CMMDSTIMEMACHINED services.

Q: What does it mean when it is in an error state?

vSAN daemons may still have issues, but this test does a very basic check to make sure that they are still running. If this reports an error, the state of the CLOMD, EPD, CMMDSD, OSFSD, and CMMDSTIMEMACHINED service(s) is not working as expected and needs to be checked on the relevant ESXi host.

A good way to further probe into CLOMD health is to perform a virtual machine creation test (Proactive tests), as this involves object creation that will exercise and test CLOMD thoroughly.
For more information about this issue, refer to the following article: CLOM Daemon Liveness Check

Q: How does one troubleshoot and fix the error state?

For standard clusters, all services CLOMD, EPD, CMMDSD, OSFSD, and CMMDSTIMEMACHINED should be running on all nodes in the cluster.

For stretched clusters and metadata clusters, see the below table whether this service is expected to be running or not for the respected node:

	Data node of stretched cluster	Witness node of stretched cluster	Data node of metadata cluster	Metadata node of metadata cluster
CLOMD	Yes	No	Yes	Yes
EPD	Yes	No	Yes	No
CMMDSD	Yes	Yes	Yes	Yes
OSFSD	Yes	No	Yes	No
cmmdsTimemachined	Yes	No	Yes	No

The unchecked daemon status of the ESXi host is shown as “--".

If CLOMD, EPD, CMMDSD, OSFSD, and CMMDSTIMEMACHINED service(s) is not running on a particular ESXi host, then the CLOMD, EPD, CMMDSD, OSFSD, and CMMDSTIMEMACHINED service(s) status of that host is Abnormal.

For this test to succeed, the health service needs to be installed on the ESXi host and the CLOMD, EPD, CMMDSD, OSFSD, and CMMDSTIMEMACHINED services need to be running. To get the status of CLOMD, EPD, CMMDSD, OSFSD, and CMMDSTIMEMACHINED service on the ESXi host, run this command:

/etc/init.d/cmmdsd status && /etc/init.d/epd status && /etc/init.d/clomd status && /etc/init.d/cmmdsTimeMachine status && /etc/init.d/osfsd status && /etc/init.d/vsanmgmtd status

If the daemon is not running, try to run the restart command on the ESXi host:

/etc/init.d/cmmdsd restart && /etc/init.d/epd restart && /etc/init.d/clomd restart && /etc/init.d/cmmdsTimeMachine restart && /etc/init.d/osfsd restart && /etc/init.d/vsanmgmtd restart

If the vSAN daemon liveness check is still failing after these steps or if the vSAN daemon liveness check continues to fail on a regular basis, open a support request with VMware Support. For more information, see Creating and managing Broadcom support cases