vCenter High Availablity enablement fails due to HACore profile startup failure and vmware-statsmonitor service startup timeout
search cancel

vCenter High Availablity enablement fails due to HACore profile startup failure and vmware-statsmonitor service startup timeout

book

Article ID: 318754

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

Symptoms:
  • Enabling VCHA on vCSA fails.
  • In the /var/log/vmware/vpxd/vpxd.log, you see a message similar to:
YYYY-MM-DDTHH:MM:SS.XXX+01:00 info vpxd[7FDFFF9F3700] [Originator@6876 sub=Default opID=FlowBasedWizard-apply-426753-ngc:70014009-f6] [VpxLxRO] -- ERROR task-31949 -- FailoverClusterConfigurator -- vim.vcha.FailoverClusterConfigurator.configure: vmodl.fault.SystemError:
--> Result:
--> (vmodl.fault.SystemError) {
--> faultCause = (vmodl.MethodFault) null,
--> faultMessage = <unset>,
--> reason = "Failed to start HACore profile on node XXX.XXX.XXX.10"
--> msg = ""
--> }
--> Args:
-->
--> Arg configSpec:
--> (vim.vcha.FailoverClusterConfigurator.VchaClusterConfigSpec) {
--> passiveIp = "XXX.XXX.XXX.9",
--> witnessIp = "XXX.XXX.XXX.10"
--> }
  • In vmon log entries from the failed peer or witness node /var/log/vmware/vmon/vmon-syslog.log​, you see a message similar to:
YYYY-MM-DDTHH:MM:SS.XXX+01:00 notice vmon Service statsmonitor start operation timed out.
YYYY-MM-DDTHH:MM:SS.XXX+01:00 notice vmon Constructed command: /usr/bin/python /usr/lib/vmware-vmon/vmonEventPublisher.py --eventdata statsmonitor,UNHEALTHY,UNKNOWN,0
YYYY-MM-DDTHH:MM:SS.XXX+01:00 warning vmon Service statsmonitor exited. Exit code 1
YYYY-MM-DDTHH:MM:SS.XXX+01:00 err vmon Service batch op START failed. Failed services: 'statsmonitor'​


Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.


Environment

VMware vCenter Server Appliance 6.7.x
VMware vCenter Server Appliance 6.5.x

Cause

In statsmonitor service initialization during service startup, the database /var/vmware/applmgmt/appliance_stats.sqlite file is checked for integrity. If the IO subsystem hosting the appliance root file system vmdk is busy or facing latency issues, database integrity check might take longer time than the default service startup timeout, causing vMon service watchdog to declare statsmonitor unhealthy and stopping it. This can cause the vMon watchdog's whole HACore profile startup fail, in turn causing VCHA enablement to be declared as failed.

Default statsmonitor service configuration assumes fast storage IO but it's still possible for IO storms to cause IO delays.

Resolution

This issue has been fixed as of the following releases:



Workaround:

To work around this issue: 
  1. Before enabling VCHA, on the vCSA node(VCHA active node)
  2. Login to vCSA with root credentials via SSH.
  3. Modify statsmonitor service config for vMon 
 sed -i '/StartTimeout/d' /etc/vmware/vmware-vmon/svcCfgfiles/statsmonitor.json
 sed -i '/ApiHealthFile/a "StartTimeout": 600,' /etc/vmware/vmware-vmon/svcCfgfiles/statsmonitor.json
  1. Reload vMon service config through SIGHUP
kill -HUP $(cat /var/run/vmon.pid)​
  1. Stop and start statmonitor service.
​​​/usr/lib/vmware-vmon/vmon-cli -k statsmonitor
/usr/lib/vmware-vmon/vmon-cli -i statsmonitor​
  1. Enable VCHA.