Healthwatch stops working throwing 503 errors
search cancel

Healthwatch stops working throwing 503 errors

book

Article ID: 370722

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Stemcell upgrade in Healthwatch deployment can fail at the smoke-test with following  "server_error: server error: 503" error

The corresponding error in grafana.log in grafana VM is:
 

logger=tsdb.prometheus t=2023-07-06T10:28:57.501798562Z level=error msg="Instant query failed" query=increase(tkgi_sli_failures_total[10m]) err="execution: server_error: server error: 503" 


From bosh view all VM's are up and running.

Cause

This happens because of corruption of wal files.

Resolution

To get back to a healthy status follow these steps

  1. Ssh to one of the tsdb VM's as root
  2. Run "monit stop prometheus" 
  3. Delete all files (not folders) from this directories: /var/vcap/store/prometheus/chunks_head/ and  in /var/vcap/store/prometheus/wal.
  4. Repeat the same for all TSDB VM's.
  5. Run "monit start prometheus" on all TSDB VMs.