Enabling HA on the cluster fails in enviroments with large number/size of datastores
search cancel

Enabling HA on the cluster fails in enviroments with large number/size of datastores

book

Article ID: 324583

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

In the UI we will see error similar to:

"Cannot complete the configuration of the vSphere HA agent on the host. "Applying HA VIB's on the cluster encountered a failure"

In the /var/log/vmware/vpxd/vpxd.log, you will see that Apply HA task fails with a timeout error.

[YYYY-MM-DDTHH:MM:SS] error vpxd[48856] [Originator@6876 sub=DAS opID=lq77pfqv-1038903-auto-m9mg-h5:70042770-d6-01] Timed out while waiting to monitor the progress of task: 52441bc3-f641-9b9f-2de5-03d7cca3d05f:com.vmware.esx.settings.clusters.software.ha_internal

[YYYY-MM-DDTHH:MM:SS] error vpxd[48856] [Originator@6876 sub=DAS opID=lq77pfqv-1038903-auto-m9mg-h5:70042770-d6-01] Apply HA task failed with error N5Vmomi5Fault11SystemError9ExceptionE(Fault cause: vmodl.fault.SystemError
--> )
--> [context]zKq7AVECAQAAAG0mVQETdnB4ZAAA9tg3bGlidm1hY29yZS5zbwAAjXgsAAtsLQAT6TIBwaFvdnB4ZAABhaNvARxlxgFrd8YBTYHGgenLYAGBKs1gAYFY3GABgbsJYAGBhrNgAQCnSSMANZ8jALRkNwKHfwBsaWJwdGhyZWFkLnNvLjAAAy82D2xpYmMuc28uNgA=[/context]
[YYYY-MM-DDTHH:MM:SS] error vpxd[48856] [Originator@6876 sub=DAS opID=lq77pfqv-1038903-auto-m9mg-h5:70042770-d6-01] ApplyHA result is null

[YYYY-MM-DDTHH:MM:SS] info vpxd[48856] [Originator@6876 sub=Default opID=lq77pfqv-1038903-auto-m9mg-h5:70042770-d6-01] [VpxLRO] -- ERROR task-14896662 -- Cluster-01 -- DasConfig.ConfigureCluster: vim.fault.DasConfigFault:
--> Result:
--> (vim.fault.DasConfigFault) {
-->    faultCause = (vmodl.MethodFault) null,
-->    faultMessage = <unset>,
-->    reason = "ApplyHAVibsOnClusterFailed",
-->    output = <unset>,
-->    event = <unset>
-->    msg = ""
--> }
--> Args:
-->

In the /var/log/vmware/vmware-updatemgr/vum-server/vmware-vum-server.log, you will see that the ApplyHA completes but takes a longer time

[YYYY-MM-DDTHH:MM:SS] info vmware-vum-server[42393] [Originator@6876 sub=ClusterApplyHATask] [Task, 457] Task:com.vmware.vcIntegrity.lifecycle.ClusterApplyHATask ID:52441bc3-f641-9b9f-2de5-03d7cca3d05f. Task Created

[YYYY-MM-DDTHH:MM:SS] info vmware-vum-server[01166] [Originator@6876 sub=ClusterApplyHATask] [Task, 457] Task:com.vmware.vcIntegrity.lifecycle.ClusterApplyHATask ID:52441bc3-f641-9b9f-2de5-03d7cca3d05f. Task State updated to SUCCEEDED

On the ESXi host, in the /var/run/log/esxupdate.log or var/run/log/lifecycle.log we will find large time gaps in the logging.

Example:

[YYYY-MM-DDTHH:MM:SS] lifecycle: 35769095: runcommand:186 INFO runcommand called with: args = '['/sbin/smbiosDump']', outfile = 'None', returnoutput = 'True', timeout = '0.0'.

Gap of 10mins

[YYYY-MM-DDTHH:MM:SS] lifecycle: 35769095: upgrade_precheck:2160 INFO Image size: 270 MB, Maximum size: 4084 MB

Gap of 9 mins

[YYYY-MM-DDTHH:MM:SS] lifecycle: 35769095: upgrade_precheck:2222 INFO Locker currently have 190036056 bytes in package folder, and 109571997696 bytes free. Incoming image has 168691854
 bytes of locker payloads, estimate to take 208946765 bytes of space.

During these time gaps, in the /var/run/log/syslog.log you will find the below entries where the datastores are being queried.

[YYYY-MM-DDTHH:MM:SS] ConfigStore[35782741]: SlowRefresh: path /vmfs/volumes/59ed59f5-8bbbd46e-8f48-246e96634ea0 total blocks 16492405981184 used blocks 13870742634496forceRefresh

[YYYY-MM-DDTHH:MM:SS] ConfigStore[35782741]: SlowRefresh: path /vmfs/volumes/65374bfc-98ed2712-4ee3-3cfdfe557a00 total blocks 128580583424 used blocks 19013828608forceRefresh = 0


Environment

VMware vCenter Server 7.x

Cause

During HA configuration, filesystem is queried multiple times. When the ESXi host is presented with a large number of datastores, the filesystem queries takes longer time. The time taken depends on the number of datastores, size and storage latency.

By design, vpxd has a timeout of 15 minutes for the applyHA task. If the applyHA tasks takes longer than 15 mins, the task is timedout in vpxd.

Note: The counter for this timeout value starts after the task is started in vum-server and not in vpxd.

Resolution

To workaround this issue, increase the timeout using the HA advanced parameter das.remediateHATaskTimeoutSecs to 1800

Note: The timeout value is specified in seconds. This value can be further increased if required based on the environment.

For information on how to set HA Advanced parameters, refer to vSphere HA Advanced Options

To manually check if the filesystem queries are taking time, we can use the below command on the ESXi host

time esxcli storage filesystem list

Note: In environments with large number of datastores, HA configuration takes longer because of the delay in querying the filesystems.