VMware Cloud Foundation for Service Providers updating a host using LCM fails since the cluster has initiated an HA Action
book
Article ID: 314616
calendar_today
Updated On:
Products
VMware Cloud Foundation
Issue/Introduction
Symptoms:
Updating a host using VMware Cloud Foundation for Service Providers 2.4.x lifecycle management (LCM) fails with an upgradeStatus of COMPLETED_WITH_FAILURE and an error similar to the following:
"metadata": "\nThe ESXi cluster is not healthy (green).\nUpgrade failed. Auto-recovery attempt failed as well. Manual intervention needed.\nCheck for errors in the lcm log files located on server 127.0.0.1 under /home/vrack/lcm/logs/\nLCM will bring the domain back online once problms found above steps are fixed manually. Please retry the upgrade once the upgrade is available again."
In the log /home/vrack/lcm/logs/lcm.log, the cluster status is not green due to a cluster config issue similar to the following:
2017-05-22 23:28:47.269 [ThreadPoolTaskExecutor-3] DEBUG [com.vmware.evo.sddc.lcm.primitive.impl.esx.ClusteredEsxUpdateClient] upgradeId=92b92aa8-####-####-####-########069a,resourceType=ESX_HOST,resourceId=0513e5ea-4ac1-4664-ba64-125d6aea975d,bundleId=3c13036f-####-####-####-########b93,bundleElementId=e1fd15e1-####-####-####-########2a6 Cluster config status: yellow 2017-05-22 23:28:47.269 [ThreadPoolTaskExecutor-3] ERROR [com.vmware.evo.sddc.lcm.primitive.impl.esx.ClusteredEsxUpdateClient] upgradeId=92b92aa8-####-####-####-########069a,resourceType=ESX_HOST,resourceId=0513e5ea-####-####-####-########75d,bundleId=3c13036f-####-####-####-########b93,bundleElementId=e1fd15e1-####-####-####-########2a6 Cluster status is not green; update cannot proceed. 2017-05-22 23:28:47.269 [ThreadPoolTaskExecutor-3] ERROR [com.vmware.evo.sddc.lcm.primitive.impl.esx.ClusteredEsxUpdateClient] upgradeId=92b92aa8-####-####-####-########69a,resourceType=ESX_HOST,resourceId=0513e5ea-####-####-####-########75d,bundleId=3c13036f-####-####-####-########b93,bundleElementId=e1fd15e1-####-####-####-########2a6 Cluster Config Issue: vSphere HA failover operation in progress in cluster ProdCluster1 in datacenter Datacenter1: 0 VMs being restarted, 2 VMs waiting for a retry, 0 VMs waiting for resources, 4 inaccessible Virtual SAN VMs
Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment as well as the HA notification. The following is an additional example:
Cluster config Issue: vSphere HA initiated a failover action on 1 virtual machines in cluster LabCluster on datacenter DC01.
In the vSphere Web Client, the same notification can be found by selecting the referenced data center and navigating to Issues under the Monitor tab. The message cannot be removed.
Environment
VMware Cloud Foundation 2.x
Resolution
Disable HA at the cluster level - wait for the HA agent to be uninstalled from all the hosts in the cluster, and then re-enable HA on the cluster.