Router service crashing on log-store VM in Metric-Store deployment
search cancel

Router service crashing on log-store VM in Metric-Store deployment

book

Article ID: 403977

calendar_today

Updated On:

Products

VMware Tanzu Application Service

Issue/Introduction

  • The monit managed 'router' process is constantly restarting on log-store VM's in the Metric-Store deployment.
  • When viewing the /var/vcap/sys/log/router/router.stderr.log, you see errors like the ones below, followed by a stacktrace:

    "warning: more than 1 hinted handoff file detected","partition":"/var/vcap/store/router/handoff/##/<LOCAL_VM_IP_ADDRESS>:9000","files":"23,24"

    Example showing folder 20 and files 23 and 24 (using fake IP 10.10.1.50 as local VM IP address):

    "warning: more than 1 hinted handoff file detected","partition":"/var/vcap/store/router/handoff/20/10.10.1.50:9000","files":"23,24"

  • This failure follows a storage outage on the underlying infrastructure datastores.

 

Environment

This problem is not version specific, it was seen on Metric-Store 1.7.0.

Cause

The underlying storage outage leads to corruption on the log-store VM's filesystem. Some handoff files are corrupted and cause the router service to fail when attempting to process them.

Resolution

Move the offending files to a backup folder created under /var/vcap/store then restart the router service:

  1. Create backup folder:

    # mkdir /var/vcap/store/log_bak

  2. Move the problem files to the backup folder (identify source file and folder from error message in router.stderr.log):

    # mv /var/vcap/store/router/handoff/##/<LOCAL_VM_IP_ADDRESS>\:9000/23 /var/vcap/store/log_bak
    # mv /var/vcap/store/router/handoff/##/<LOCAL_VM_IP_ADDRESS>\:9000/24 /var/vcap/store/log_bak


    Example (using fake IP 10.10.1.50 as local VM IP address):

    # mv /var/vcap/store/router/handoff/20/10.10.1.50\:9000/23 /var/vcap/store/log_bak

  3. Restart the router service

    # sudo monit restart router

 

Note that this will restore the router service, however, the logging present in files 23 and 24 will not be persisted as the files are corrupted and unrecoverable.