Healthwatch 2.2.1 TSDB VM'S in failing state
search cancel

Healthwatch 2.2.1 TSDB VM'S in failing state

book

Article ID: 345708

calendar_today

Updated On:

Products

VMware

Issue/Introduction

Symptoms:

Healthwatch TSDB VM'S in a failing state and Prometheus Job failed, unable to restart using monit commands.

tsdb/b1e51d26-8418-4c5e-9244-7580378a1d0b      failing       CHL-LAB-PKS001 10.194.156.35 vm-b2113751-7084-43dd-ac39-85244e16e10a xlarge     true   bosh-vsphere-esxi-ubuntu-xenial-go_agent/621.236
tsdb/d1d08af1-e5dd-4a69-b582-403f50676cae      failing       CHL-LAB-PKS001 10.194.156.26 vm-ba791e2a-de55-4ddf-ab4d-3cd995a8f71b xlarge     true   bosh-vsphere-es
You can see errors similar to the following under Prometheus-stderr.log and TSDB VM’s logs :
Task 3018263 | 05:14:40 | failed jobs: prometheusTask 3018263 | 05:19:40 | Error: 'tsdb/b1e51d26-8418-4c5e-9244-7580378a1d0b (0)' is not running after update. Review logs for failed jobs:
/var/vcap/data/packages/ruby-2.6.8-r0.58.0/082ece384379512d3506533aa31d656cdbfc97de/lib/ruby/2.6.0/psych.rb:456:in `parse': (/var/vcap/store/pks-cluster-discovery/scrape_configs.yml): found unexpected end of stream while scanning a quoted scalar at line 1240 column 17 (Psych::SyntaxError)
from /var/vcap/data/packages/ruby-2.6.8-r0.58.0/082ece384379512d3506533aa31d656cdbfc97de/lib/ruby/2.6.0/psych.rb:456:in `parse_stream'
(/var/vcap/store/pks-cluster-discovery/scrape_configs.yml): found unexpected end of stream while scanning a quoted scalar at line 1240 column 17 (Psych::SyntaxError) ?
Verified the var/vcap/store/pks-cluster-discovery/scrape_configs.yml file and found its corrupted .
 msg="Loading on-disk chunks failed" err="iterate on on-disk chunks: out of sequence m-mapped chunk for series",

Also, verify if there are any additional scrap jobs configured and  scrape_configs.yml for any syntax errors.


Environment

VMware Tanzu Kubernetes Grid Integrated Edition 1.x

Cause

Prometheus Scrap_config file is corrupted  due to incorrect syntax and configuration. It is a Known issue for the Healthwatch Version 2.2.1 https://github.com/prometheus/prometheus/pull/10406 in Prometheus which is fixed in newer versions

Resolution

To fix the issue, make sure scrape_configs.yml  is empty before applying changes.
echo $'---\n[]\n' > /var/vcap/store/pks-cluster-discovery/scrape_configs.yml

monit stop prometheus

echo $'---\n[]\n' > /var/vcap/store/pks-cluster-discovery/scrape_configs.yml

monit restart pks-cluster-discovery //wait for running status

monit restart prometheus
NOTE : In this particular issue, /var/vcap/store/prometheus is consuming around 460 GB. Make sure you add enough resources to tsdb vms before applying changes.Verify the chunks sizes in /var/vcap/store/prometheus/chunks_head folder, if there is a file with 0 size then perform the same steps above. 
  • To start prometheus without scaling the VMs, it is necessary to perform the following steps on both TSDB VMs:
  • ssh to TSDB VM
  • monit stop prometheus
  • delete all the chunks from this directory: /var/vcap/store/prometheus/chunks_head/
  • To check the /var/vcap/bosh/etc/monitrc file on both TSDB vms if monit summary is not working on your TSDB VM'S perform below steps:
set daemon 10

set logfile /var/vcap/monit/monit.log
 
set httpd port 2822 and use address 127.0.0.1

allow cleartext /var/vcap/monit/monit.user

include /var/vcap/monit/*.monitrc

include /var/vcap/monit/job/*.monitrc
  • Perform below steps if monit commands are not working on both TSDB VM'S
  • ssh to TSDB VM
  • monit stop prometheus (if monit does not work can skip this step)
  • delete all the chunks from this directory: /var/vcap/store/prometheus/chunks_head/
Then open OpsManager and perform the following steps:
  • in HW tile on Prometheus pane put Scrape interval value 10m or more to decrease the load on TSDB
  • Perform Apply Changes to reinstall HW from OpsManager
Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Additional Information

Healthwatch version is 2.2.1 Ops manager:

Tanzu Ops Manager v2.10.39-build.450

TKGI tile version: 1.13.4 build 1.5


Impact/Risks: