Healthwatch 2.2.1 TSDB VM'S in failing state

Products

VMware

Issue/Introduction

Symptoms:

Healthwatch TSDB VM'S in a failing state and Prometheus Job failed, unable to restart using monit commands.

tsdb/b1e51d26-8418-4c5e-9244-7580378a1d0b      failing       CHL-LAB-PKS001 10.194.156.35 vm-b2113751-7084-43dd-ac39-85244e16e10a xlarge     true   bosh-vsphere-esxi-ubuntu-xenial-go_agent/621.236

tsdb/d1d08af1-e5dd-4a69-b582-403f50676cae      failing       CHL-LAB-PKS001 10.194.156.26 vm-ba791e2a-de55-4ddf-ab4d-3cd995a8f71b xlarge     true   bosh-vsphere-es

You can see errors similar to the following under Prometheus-stderr.log and TSDB VM’s logs :

Task 3018263 | 05:14:40 | failed jobs: prometheusTask 3018263 | 05:19:40 | Error: 'tsdb/b1e51d26-8418-4c5e-9244-7580378a1d0b (0)' is not running after update. Review logs for failed jobs:
/var/vcap/data/packages/ruby-2.6.8-r0.58.0/082ece384379512d3506533aa31d656cdbfc97de/lib/ruby/2.6.0/psych.rb:456:in `parse': (/var/vcap/store/pks-cluster-discovery/scrape_configs.yml): found unexpected end of stream while scanning a quoted scalar at line 1240 column 17 (Psych::SyntaxError)
from /var/vcap/data/packages/ruby-2.6.8-r0.58.0/082ece384379512d3506533aa31d656cdbfc97de/lib/ruby/2.6.0/psych.rb:456:in `parse_stream'
(/var/vcap/store/pks-cluster-discovery/scrape_configs.yml): found unexpected end of stream while scanning a quoted scalar at line 1240 column 17 (Psych::SyntaxError) ?
Verified the var/vcap/store/pks-cluster-discovery/scrape_configs.yml file and found its corrupted .
 msg="Loading on-disk chunks failed" err="iterate on on-disk chunks: out of sequence m-mapped chunk for series",

Also, verify if there are any additional scrap jobs configured and scrape_configs.yml for any syntax errors.

Environment

VMware Tanzu Kubernetes Grid Integrated Edition 1.x

Cause

Prometheus Scrap_config file is corrupted due to incorrect syntax and configuration. It is a Known issue for the Healthwatch Version 2.2.1 https://github.com/prometheus/prometheus/pull/10406 in Prometheus which is fixed in newer versions

Resolution

To fix the issue, make sure scrape_configs.yml is empty before applying changes.

echo $'---\n[]\n' > /var/vcap/store/pks-cluster-discovery/scrape_configs.yml

monit stop prometheus

echo $'---\n[]\n' > /var/vcap/store/pks-cluster-discovery/scrape_configs.yml

monit restart pks-cluster-discovery //wait for running status

monit restart prometheus

NOTE : In this particular issue, /var/vcap/store/prometheus is consuming around 460 GB. Make sure you add enough resources to tsdb vms before applying changes.Verify the chunks sizes in /var/vcap/store/prometheus/chunks_head folder, if there is a file with 0 size then perform the same steps above.

To start prometheus without scaling the VMs, it is necessary to perform the following steps on both TSDB VMs:
ssh to TSDB VM
monit stop prometheus
delete all the chunks from this directory: /var/vcap/store/prometheus/chunks_head/
To check the /var/vcap/bosh/etc/monitrc file on both TSDB vms if monit summary is not working on your TSDB VM'S perform below steps:

set daemon 10

set logfile /var/vcap/monit/monit.log
 
set httpd port 2822 and use address 127.0.0.1

allow cleartext /var/vcap/monit/monit.user

include /var/vcap/monit/*.monitrc

include /var/vcap/monit/job/*.monitrc

Perform below steps if monit commands are not working on both TSDB VM'S
ssh to TSDB VM
monit stop prometheus (if monit does not work can skip this step)
delete all the chunks from this directory: /var/vcap/store/prometheus/chunks_head/

Then open OpsManager and perform the following steps:

in HW tile on Prometheus pane put Scrape interval value 10m or more to decrease the load on TSDB
Perform Apply Changes to reinstall HW from OpsManager

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Additional Information

Healthwatch version is 2.2.1 Ops manager:

Tanzu Ops Manager v2.10.39-build.450

TKGI tile version: 1.13.4 build 1.5

Impact/Risks: