Unexpected VCHA failover could happen due to health check failure of lookupsvc
search cancel

Unexpected VCHA failover could happen due to health check failure of lookupsvc

book

Article ID: 433815

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

A VCHA failover could happen if the active node restarts due to a failed health check of lookupsvc.  On the Primary node, it can be seen in /var/log/vmware/vmon/vmon.log as a timeout 30 seconds after the API command is run:

<DATE_TIME>In(05) host-2480 <lookupsvc> Running the API Health command as user lookupsvc
<DATE_TIME>In(05) host-2480 <lookupsvc-healthcmd> Constructed command: /usr/bin/python /usr/lib/vmware-lookupsvc/bin/lookupsvc-health.py /var/log/vmware/lookupsvc/lookupsvc-health.log
<DATE_TIME>Wa(03) host-2480 <lookupsvc-healthcmd> SysProcess exec timed out. Force kill. Pid 114411
<DATE_TIME>Wa(03) host-2480 <lookupsvc> Service api healthcheck command timedout.
<DATE_TIME>Er(02) host-2480 <lookupsvc> health state unknown ,considered as system failure
<DATE_TIME>Er(02) host-2480 System Failure, initiating system restart.

Environment

VMware vCenter Server in VCHA

Cause

The failed health check is due to the lookupsvc becoming unresponsive during Garbage Collection just prior to the health check.  This can be seen in the /var/log/vmware/lookupsvc/vmware-lookupsvc-gc.log. 

<DATE_TIME>: 3899856.672: 
[Full GC (Ergonomics) <DATE_TIME>: 3899861.490: 
[SoftReference, 105 refs, 0.0000496 secs] <DATE_TIME> 3899861.490: 
[WeakReference, 56619 refs, 0.0090140 secs] <DATE_TIME>: 3899861.499: 
[FinalReference, 1107 refs, 0.2260642 secs] <DATE_TIME>: 3899861.725: 
[PhantomReference, 55097 refs, 0.0012427 secs] <DATE_TIME>: 3899861.726: 
[JNI Weak Reference, 0.0000287 secs]

 

Resolution

Edit the lookupsvc JSON config file to include some additional GC parameters and extend the health check timeout:

SSH to VC as root:

cp /etc/vmware/vmware-vmon/svcCfgfiles/lookupsvc.json /etc/vmware/vmware-vmon/svcCfgfiles/lookupsvc.json.bak

vi /etc/vmware/vmware-vmon/svcCfgfiles/lookupsvc.json

Type I (for insert), then add the 4 GC parameters under the "StartCommandArgs" node.  

  "RunAsUser": "lookupsvc",
  "StartCommand": "%VMWARE_JAVA_HOME%/bin/lookupsvc",
  "StartCommandArgs": [
      <...snip...>
      "-XX:ParallelGCThreads=2",
      "-XX:+UseGCLogFileRotation",
      "-XX:NumberOfGCLogFiles=10",
      "-XX:GCLogFileSize=10M",
      <...snip...>
      "-Xloggc:%VMWARE_LOG_DIR%/vmware/lookupsvc/vmware-lookupsvc-gc.log",
      "org.apache.catalina.startup.Bootstrap",
      "fips_ready",
      "start"
  ] ,

Next, locate the following line:

"ApiHealthCmdTimeout": 30,

and update the value to 60 seconds:

"ApiHealthCmdTimeout": 60,

Type :wq! to save and close the file, then restart the lookup service:

service-control --stop lookupsvc
service-control --start lookupsvc

 

Additional Information

Also see:
Unexpected VCHA failover could happen due to health check failure of vc-ws1a-broker.