vCLS VMs are not re-created in a vSAN Cluster following a complete shutdown of a vSAN cluster

search cancel

vCLS VMs are not re-created in a vSAN Cluster following a complete shutdown of a vSAN cluster

book

Article ID: 326304

calendar_today

Updated On: 10-16-2024

Products

VMware Cloud Foundation VMware vCenter Server

Issue/Introduction

Symptoms:

vCLS VMs are not re-created in a vSAN Cluster following a complete shutdown of the vSAN cluster.
This is more likely due to an improper shutdown of the vSAN Cluster but can occur after a proper shutdown and restart procedure as well.
An error message is displayed in vSphere Client:

vSphere DRS functionality was impacted due to unhealthy state of vSphere Cluster Services caused by the unavailability of vSphere Cluster Service VMs. vSphere Cluster Service VMs are required to maintain the health of vSphere DRS.

When looking in the EAM MOB [https://<vc_fqdn>/eam/mob] for the cluster, the following information can be found:

Environment

VMware vCenter Server 7.0.x

VMware vCenter Server 8.0.x
VMware Cloud Foundation 4.x
VMware Cloud foundation 5.x

Cause

When a vSAN Cluster is shut down (properly or improperly), an API call is made to EAM to disable the vCLS Agency on the cluster. In an ideal workflow, when the cluster is back online, it is marked as enabled again so that vCLS VMs can be powered on or new ones can be created, depending on the vCLS slots determined on the cluster.

When this workflow goes awry, the cluster is marked as disabled for the vCLS Agency, and none of the automated workflows mark the cluster as enabled again. As a result, no vCLS VMs are created for the cluster, and DRS remains in a non-healthy state.

The cluster is marked in a disabled state by an entry created for the cluster in the VCDB in the table vpx_ext_data.

Resolution

WARNING: Take offline snapshots of all vCenter Servers in the SSO domain before running through these steps.
Incorrect changes to the VCDB can cause a catastrophic failure of the vCenter, which we may be unable to recover from.

Login to the vSphere UI and click on the cluster in question.
From the URL, record the cluster ID. It should be domain-cxx.

In the example above, for the selected cluster, the ID is domain-c132.

Ensure that the Retreat Mode Advanced setting for this cluster is set to True as described in https://knowledge.broadcom.com/external/article?legacyId=80472 .
Connect to the vCenter Server Appliance managing the cluster per SSH:
Connect to the VCDB via the vPostgres shell:

/opt/vmware/vpostgres/current/bin/psql -U postgres -d VCDB
Identify the clusters that are marked as disabled:

select * from vpx_ext_data where data_key like '%DisabledClusters%';

Example output:

Delete the entry associated with the cluster ID we are working on, using the surr_key:

delete from vpx_ext_data where surr_key = <surr_key recorded above>;

Example output:

Leave the vPostgres shell:

\q

Restart all services on the vCenter to ensure all services are coming back online:

service-control --stop --all && service-control --start --all
Once all the services are back online, login to the vSphere UI, and confirm that the vCLS VMs are created for the cluster, and the vSphere Cluster Services status is set to healthy.

Additional Information

vSphere Cluster Services (vCLS) in vSphere 7.0 Update 1 and newer versions (80472)

Manually Shut Down and Restart the vSAN Cluster

Impact/Risks:

WARNING: This process involves making changes to the vCenter Database. Please take offline snapshots of all vCenters in the SSO before running through the workaround steps.

Feedback

Was this article helpful?

thumb_up Yes

thumb_down No