Restarting vSAN cluster fails at "Re-enable HA on this cluster" with the error "Wait for HA enable failed"

Products

VMware vSAN

Issue/Introduction

When restarting a vSAN cluster using the Cluster Shutdown and Startup Wizard (after selecting Turn off vSAN and then Restart), the operation fails at the step:

Re-enable HA on this cluster
Wait for HA enable failed

The cluster restart does not complete successfully. After clicking Resume Restart, the operation continues to fail at the same “Re-enable HA on this cluster” step with the same error.

On the ESXi hosts, the following messages can be observed in /var/run/log/lifecycle.log:

DepotCollection:### Downloading depot index.xml from http://<VC_FQDN>:9084/vum/repository/hostupdate/__micro-depot__vendor-vmw__metadata-13__index__.xml
Downloader:### Opening http://<VC_FQDN>:9084/vum/repository/hostupdate/__micro-depot__vendor-vmw__metadata-13__index__.xml for download
Downloader:### Download failed: <urlopen error [Errno -3] Temporary failure in name resolution>, 9 retry left...
Downloader:### Download failed: <urlopen error [Errno -3] Temporary failure in name resolution>, 8 retry left...
Downloader:### Download failed: <urlopen error [Errno -3] Temporary failure in name resolution>, 7 retry left...
Downloader:### Download failed: <urlopen error [Errno -3] Temporary failure in name resolution>, 6 retry left...
Downloader:### Download failed: <urlopen error [Errno -3] Temporary failure in name resolution>, 5 retry left...
Downloader:### Download failed: <urlopen error [Errno -3] Temporary failure in name resolution>, 4 retry left...
Downloader:### Download failed: <urlopen error [Errno -3] Temporary failure in name resolution>, 3 retry left...
Downloader:### Download failed: <urlopen error [Errno -3] Temporary failure in name resolution>, 2 retry left...
Downloader:### Download failed: <urlopen error [Errno -3] Temporary failure in name resolution>, 1 retry left...
DepotCollection:### Could not download from depot at http://<VC_FQDN>:9084/vum/repository/hostupdate/__micro-depot__vendor-vmw__metadata-13__index__.xml, skipping (('http://<VC_FQDN>:9084/vum/repository/hostupdate/__micro-depot__vendor-vmw__metadata-13__index__.xml', '', '<urlopen error [Errno -3] Temporary failure in name resolution>'))

Environment

VMware vSAN cluster
vSphere HA enabled
Cluster Shutdown and Restart Wizard in use
vCenter managed by FQDN (PNID is a FQDN)
DNS service unavailable at the time of cluster restart

Cause

During the Restart operation, the cluster startup wizard attempts to re-enable vSphere HA on the cluster.

As part of this process, ESXi hosts must resolve the vCenter FQDN in order to connect to the vCenter Lifecycle Manager (vLCM) via TCP port 9084 and download required HA-related metadata.

If DNS services are unavailable at this stage, ESXi hosts cannot resolve the vCenter hostname. This causes the download of HA-related metadata to fail, resulting in the failure of the “Re-enable HA on this cluster” step and preventing the cluster restart from completing.

One common scenario is when the DNS server is hosted within the same vSAN cluster and has not yet been powered on. However, the issue can occur in any environment where DNS is unavailable, regardless of where the DNS server is hosted.

Resolution

Ensure that DNS services are available and functioning before restarting the vSAN cluster.
Verify that ESXi hosts can resolve the vCenter FQDN.
If DNS cannot be made available immediately, configure one of the following workaround options on each ESXi host to provide vCenter Server name resolution.
Only one method is required.

* Non-persistent workaround: Add an entry directly to /etc/hosts. This change does not persist across ESXi host reboots.
```
<vCenter_IP_Address>   <vCenter_FQDN>
```
* Persistent workaround: Use the following command to add a host entry that persists across ESXi host reboots:
```
esxcli network ip hosts add --hostname <VC_FQDN> --ip <VC_IP>
```
Restart the vCenter .
Retry the RESTART operation

Once name resolution is restored, ESXi hosts can successfully retrieve the required HA metadata and the cluster restart completes successfully.

Additional Information

vSphere HA enablement depends on proper DNS name resolution of the vCenter when the vCenter was deployed with FQDN.
Using /etc/hosts as a temporary workaround when DNS is unavailable
DNS unavailability at cluster startup time can cause failures even if DNS servers are hosted outside the cluster.