After upgrade of vCenter 7, a lot of clusters have HA in an unconfigured state / failing
search cancel

After upgrade of vCenter 7, a lot of clusters have HA in an unconfigured state / failing

book

Article ID: 367744

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

After, rsp. during upgrade of a 7.0.x vCenter many hosts remain in an unconfigured state, marked as failing, regarding HA. The tasklist is filled with attempts to update the ESXi hosts. This is a behavior specific to large environments (>400 hosts) with vLCM in use.

Environment

vSphere 7.0

vSphere 7.0.3

Cause

The root cause of the issue here is that the environment is serving too many hosts, especially exceeding the documented config maximum for vLCM regarding managed hosts. This max is 400 hosts.

Sequence of events is as such:

  1. The update includes an update to the FDM. That package also holds the client-side of the FDM (vsphere-ha.zip) for deployment on each host.
  2. The FDM gets restarted which triggers the deployment of the vsphere-ha.zip into the offline-depot.
  3. While the mere zip is deployed the classic way, the update of the hosts (also triggered by the restarting FDM) involves a cluster image remediation.
  4. With this remediation >400 hosts query the vLCM for remediation while this can only handle 400 concurrent requests.
  5. Consequentially, all excess hosts do not get an answer and time out within a retry loop, eventually just failing the task.

Resolution

If you are aware of the issue upfront the vCenter update:

  • disable HA on all clusters prior to the update.
  • Do the update then, and
  • afterwards re-enable the HA for clusters in chunks/batches of less then 400 hosts, at a time.

If the situation comes up during, rsp. after the update:

  • disable HA on all clusters,
  • if the tasklist does not empty in reasonable time, reboot the vCenter in the usual fashion. This empties the tasklist.
  • Then re-enable the HA for the clusters, but not all at once. Enable them in "batches":
    • Select only that many clusters for the batch to re-enable HA , so it holds no more than 400 hosts.
    • Wait until that batch is enabled and then run the next batch.
    • This makes sure that the limit of 400 (concurrent) hosts of the vLCM does not get exceeded.
  • Expect a higher level of vMotion for a while as HA got heavily shaken. Temporarily, performance might be impacted by congestion in the clusters, but stability of the system should not be at risk.

Finish the update by checking all clusters for hosts that did not make it back to "HA enabled". Putting those still disabled into Maintenance and back out should bring everything back to normal.

Additional Information

The describe issue is not a defect. It happens when a configuration of an environment exceeds the documented configuration maximum of 400 hosts per vCenter when making use of vLCM (vSphere 7). The described resolution is only a workaround just in case such a misconfiguration happens. A proper resolution would be to upgrade to vSphere 8 where the vLCM managed hosts limit has been lifted from 400 to 1000.