Error: "Cannot complete the configuration of the vSphere HA agent on the host. "Applying HA VIB's on the cluster encountered a failure" when enabling vSphere HA on the cluster with large number/size of datastores

search cancel

Error: "Cannot complete the configuration of the vSphere HA agent on the host. "Applying HA VIB's on the cluster encountered a failure" when enabling vSphere HA on the cluster with large number/size of datastores

book

Article ID: 324583

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

When you attempt to enable vSphere High Availability (HA) on a cluster, the UI may display:

"Cannot complete the configuration of the vSphere HA agent on the host. "Applying HA VIB's on the cluster encountered a failure"

Meanwhile, the vpxd logs often show a timeout error such as:

[YYYY-MM-DDTHH:MM:SS] error vpxd[48856] [Originator@6876 sub=DAS opID=########-#######-####-####-##:########-##-##] Timed out while waiting to monitor the progress of task: ########-####-####-####-############:com.vmware.esx.settings.clusters.software.ha_internal

[YYYY-MM-DDTHH:MM:SS] error vpxd[48856] [Originator@6876 sub=DAS opID=########-#######-####-####-##:########-##-##] Apply HA task failed with error N5Vmomi5Fault11SystemError9ExceptionE(Fault cause: vmodl.fault.SystemError
--> )
--> [context]zKq7AVECAQAAAG0mVQETdnB4ZAAA9tg3bGlidm1hY29yZS5zbwAAjXgsAAtsLQAT6TIBwaFvdnB4ZAABhaNvARxlxgFrd8YBTYHGgenLYAGBKs1gAYFY3GABgbsJYAGBhrNgAQCnSSMANZ8jALRkNwKHfwBsaWJwdGhyZWFkLnNvLjAAAy82D2xpYmMuc28uNgA=[/context]
[YYYY-MM-DDTHH:MM:SS] error vpxd[48856] [Originator@6876 sub=DAS opID=########-#######-####-####-##:########-##-##] ApplyHA result is null

[YYYY-MM-DDTHH:MM:SS] info vpxd[48856] [Originator@6876 sub=Default opID=########-#######-####-####-##:########-##-##] [VpxLRO] -- ERROR task-14896662 -- Cluster-01 -- DasConfig.ConfigureCluster: vim.fault.DasConfigFault:
--> Result:
--> (vim.fault.DasConfigFault) {
-->    faultCause = (vmodl.MethodFault) null,
-->    faultMessage = <unset>,
-->    reason = "ApplyHAVibsOnClusterFailed",
-->    output = <unset>,
-->    event = <unset>
-->    msg = ""
--> }
--> Args:
-->

In the /var/log/vmware/vmware-updatemgr/vum-server/vmware-vum-server.log, ClusterApplyHATask completes but takes a longer time

[YYYY-MM-DDTHH:MM:SS] info vmware-vum-server[42393] [Originator@6876 sub=ClusterApplyHATask] [Task, 457] Task:com.vmware.vcIntegrity.lifecycle.ClusterApplyHATask ID:########-####-####-####-############. Task Created

[YYYY-MM-DDTHH:MM:SS] info vmware-vum-server[01166] [Originator@6876 sub=ClusterApplyHATask] [Task, 457] Task:com.vmware.vcIntegrity.lifecycle.ClusterApplyHATask ID:########-####-####-####-############. Task State updated to SUCCEEDED

On the ESXi host, in the /var/run/log/esxupdate.log or var/run/log/lifecycle.log large time gaps in the logging is observed.

Example:

[YYYY-MM-DDTHH:MM:SS] lifecycle: 35769095: runcommand:186 INFO runcommand called with: args = '['/sbin/smbiosDump']', outfile = 'None', returnoutput = 'True', timeout = '0.0'.

Gap of 10mins

[YYYY-MM-DDTHH:MM:SS] lifecycle: 35769095: upgrade_precheck:2160 INFO Image size: 270 MB, Maximum size: 4084 MB

Gap of 9 mins

[YYYY-MM-DDTHH:MM:SS] lifecycle: 35769095: upgrade_precheck:2222 INFO Locker currently have 190036056 bytes in package folder, and 109571997696 bytes free. Incoming image has 168691854
bytes of locker payloads, estimate to take 208946765 bytes of space.

During these time gaps, in the /var/run/log/syslog.log, below entries where the datastores are being queried are found.

[YYYY-MM-DDTHH:MM:SS] ConfigStore[35782741]: SlowRefresh: path /vmfs/volumes/<Datastore_UUID> total blocks 16492405981184 used blocks 13870742634496forceRefresh

[YYYY-MM-DDTHH:MM:SS] ConfigStore[35782741]: SlowRefresh: path /vmfs/volumes/<Datastore_UUID> total blocks 128580583424 used blocks 19013828608forceRefresh = 0

Note: The above log messages can be confirmed in the cluster managed by vLCM images.

Environment

VMware vCenter Server 7.x
VMware vCenter Server 8.x

Cause

During HA configuration, vCenter queries the host filesystems repeatedly. In environments with many or large datastores, these queries can take a long time. Since vpxd enforces a timeout of 15 minutes for the HA apply task, measured from the moment the task starts in vCenter, slow filesystem scanning may trigger a timeout and cause configuration failure.

Note: The counter for this timeout value starts after the task is started in the vum-server and not in vpxd.

Resolution

To address this issue, increase the timeout using the HA advanced parameter das.remediateHATaskTimeoutSecs to 1800

1. Navigate to the cluster in vCenter, then go to Configure → vSphere Availability → Advanced Options
2. Add or modify the parameter das.remediateHATaskTimeoutSecs, setting a value of at least 1800 seconds
3. Save and re-enable vSphere HA on the cluster
4. Monitor the task and logs to confirm completion within the new timeout window

Note: The timeout value is specified in seconds. This value can be further increased if required, based on the environment.

For information on how to set HA Advanced parameters, refer to vSphere HA Advanced Options

To check whether filesystem operations are slow, execute:

time esxcli storage filesystem list

Note: In environments with a large number of datastores, HA configuration takes longer because of the delay in querying the filesystems.

Feedback

thumb_up Yes

thumb_down No