vCLS VMs are not re-created in a vSAN Cluster following a complete shutdown of a vSAN cluster
search cancel

vCLS VMs are not re-created in a vSAN Cluster following a complete shutdown of a vSAN cluster

book

Article ID: 326304

calendar_today

Updated On:

Products

VMware Cloud Foundation VMware vCenter Server

Issue/Introduction

Reset the status of the cluster and enable the automatic creation of vCenter VMs

Symptoms:
  • vCLS VMs are not re-created in a vSAN Cluster following a complete shutdown of the vSAN cluster.

  • This is more likely due to an improper shutdown of the vSAN Cluster but can occur after a proper shutdown and restart procedure as well.

  • An error message is displayed in vSphere Client, saying:

vSphere DRS functionality was impacted due to unhealthy state of vSphere Cluster Services caused by the unavailability of vSphere Cluster Service VMs. vSphere Cluster Service VMs are required to maintain the health of vSphere DRS.

image.png

  • When looking in the EAM MOB [https://<vc_fqdn>/eam/mob] for the cluster, the following information can be found:

image.png
image.png


Environment

VMware vCenter Server 8.0.x
VMware vCenter Server 7.0.x
VMware Cloud Foundation 4.x
VMware Cloud foundation 5.x

Cause

When a vSAN Cluster is shutdown (proper or improper), an API call is made to EAM to disable the vCLS Agency on the cluster. In an ideal workflow, when the cluster is back online, the Cluster is marked as enabled again, so that vCLS VMs can be powered on, or new ones can be created, depending on the vCLS slots determined on the cluster.

When this workflow goes awry, the cluster is marked as disabled for the vCLS Agency, and none of the automated workflows mark the cluster as enabled again. As a result, no vCLS VMs are created for the cluster, and DRS remains in an non-healthy state.

The cluster is marked in a disabled state by an entry created for the cluster in the VCDB, in the table: vpx_ext_data

Resolution

WARNING: Please take offline snapshots of all vCenter Servers in the SSO domain before running through these steps.
Incorrect changes to the VCDB can cause a catastrophic failure of the vCenter, which we may not be able to recover from.

  1. Login to the vSphere UI and click on the cluster in question.

  2. From the URL, record the cluster ID. It should be domain-cxx.

image.png

In the example above, for the selected cluster, the ID is domain-c132.
 

  1. Ensure that the Retreat Mode Advanced setting for this cluster is set to True as described in https://kb.vmware.com/s/article/80472

  2. Connect to the vCenter Server Appliance managing the cluster per SSH:

  3. Connect to the VCDB via the vPostgres shell:

# /opt/vmware/vpostgres/current/bin/psql -U postgres -d VCDB

 

  1. Identify the clusters that are marked as disabled:

# select * from vpx_ext_data where data_key like '%DisabledClusters%';

The output will look something like this:

image.png

 

  1. Delete the entry associated with the cluster ID we are working on, using the surr_key:

# delete from vpx_ext_data where surr_key = <surr_key recorded above>;

In our example:

image.png

 

  1. Leave the vPostgres shell:

\q

 

  1. Restart all services on the vCenter to ensure all services are coming back online:

# service-control --stop --all && service-control --start --all



Once all the services are back online, login to the vSphere UI, and confirm that the vCLS VMs are created for the cluster, and the vSphere Cluster Services status is set to healthy.
 


Additional Information

vSphere Cluster Services (vCLS) in vSphere 7.0 Update 1 and newer versions (80472)
https://kb.vmware.com/s/article/80472


Manually Shut Down and Restart the vSAN Cluster
https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.vsan-monitoring.doc/GUID-31B4F958-30A9-4BEC-819E-32A18A685688.html


Impact/Risks:

WARNING: This process involves making changes to the vCenter Database.
Please take offline snapshots of all vCenters in the SSO before running through the workaround steps.