Healthwatch TSDB VM'S in a failing state and Prometheus Job failed, unable to restart using monit commands.
tsdb/b1e51d26-8418-4c5e-9244-7580378a1d0b failing CHL-LAB-PKS001 10.194.156.35 vm-b2113751-7084-43dd-ac39-85244e16e10a xlarge true bosh-vsphere-esxi-ubuntu-xenial-go_agent/621.236
tsdb/d1d08af1-e5dd-4a69-b582-403f50676cae failing CHL-LAB-PKS001 10.194.156.26 vm-ba791e2a-de55-4ddf-ab4d-3cd995a8f71b xlarge true bosh-vsphere-es
You can see errors similar to the following under Prometheus-stderr.log and TSDB VM’s logs :
Task 3018263 | 05:14:40 | failed jobs: prometheusTask 3018263 | 05:19:40 | Error: 'tsdb/b1e51d26-8418-4c5e-9244-7580378a1d0b (0)' is not running after update. Review logs for failed jobs:
/var/vcap/data/packages/ruby-2.6.8-r0.58.0/082ece384379512d3506533aa31d656cdbfc97de/lib/ruby/2.6.0/psych.rb:456:in `parse': (/var/vcap/store/pks-cluster-discovery/scrape_configs.yml): found unexpected end of stream while scanning a quoted scalar at line 1240 column 17 (Psych::SyntaxError)
from /var/vcap/data/packages/ruby-2.6.8-r0.58.0/082ece384379512d3506533aa31d656cdbfc97de/lib/ruby/2.6.0/psych.rb:456:in `parse_stream'
(/var/vcap/store/pks-cluster-discovery/scrape_configs.yml): found unexpected end of stream while scanning a quoted scalar at line 1240 column 17 (Psych::SyntaxError) ?
Verified the var/vcap/store/pks-cluster-discovery/scrape_configs.yml file and found its corrupted .
msg="Loading on-disk chunks failed" err="iterate on on-disk chunks: out of sequence m-mapped chunk for series",
Also, verify if there are any additional scrap jobs configured and scrape_configs.yml for any syntax errors.
Prometheus Scrap_config file is corrupted due to incorrect syntax and configuration. It is a Known issue for the Healthwatch Version 2.2.1 https://github.com/prometheus/prometheus/pull/10406 in Prometheus which is fixed in newer versions
echo $'---\n[]\n' > /var/vcap/store/pks-cluster-discovery/scrape_configs.yml
monit stop prometheus
echo $'---\n[]\n' > /var/vcap/store/pks-cluster-discovery/scrape_configs.yml
monit restart pks-cluster-discovery //wait for running status
monit restart prometheus
NOTE : In this particular issue, /var/vcap/store/prometheus is consuming around 460 GB. Make sure you add enough resources to tsdb vms before applying changes.Verify the chunks sizes in /var/vcap/store/prometheus/chunks_head folder, if there is a file with 0 size then perform the same steps above.
set daemon 10
set logfile /var/vcap/monit/monit.log
set httpd port 2822 and use address 127.0.0.1
allow cleartext /var/vcap/monit/monit.user
include /var/vcap/monit/*.monitrc
include /var/vcap/monit/job/*.monitrc
Healthwatch version is 2.2.1 Ops manager:
Tanzu Ops Manager v2.10.39-build.450
TKGI tile version: 1.13.4 build 1.5