Troubleshooting failing prometheus job when applying change to Healthwatch Tile

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

You are upgrading Healthwatch tile or apply changes on the Healthwatch tile and this fails with a generic error

 Task 833007 | 03:35:16 | Preparing deployment: Preparing deployment (00:00:05) Task 833007 | 03:35:21 | Preparing deployment: Rendering templates (00:00:14) Task 833007 | 03:35:36 | Preparing package compilation: Finding packages to compile (00:00:00) Task 833007 | 03:35:38 | Updating instance tsdb: tsdb/1465eb1e-58a1-49e2-beeb-3d1a58823c85 (0) (canary) Task 833007 | 03:35:41 | L executing pre-stop: tsdb/1465eb1e-58a1-49e2-beeb-3d1a58823c85 (0) (canary) Task 833007 | 03:35:41 | L executing drain: tsdb/1465eb1e-58a1-49e2-beeb-3d1a58823c85 (0) (canary) Task 833007 | 03:35:42 | L stopping jobs: tsdb/1465eb1e-58a1-49e2-beeb-3d1a58823c85 (0) (canary) Task 833007 | 03:36:10 | L executing post-stop: tsdb/1465eb1e-58a1-49e2-beeb-3d1a58823c85 (0) (canary) Task 833007 | 03:36:19 | L installing packages: tsdb/1465eb1e-58a1-49e2-beeb-3d1a58823c85 (0) (canary) Task 833007 | 03:36:22 | L configuring jobs: tsdb/1465eb1e-58a1-49e2-beeb-3d1a58823c85 (0) (canary) Task 833007 | 03:36:22 | L executing pre-start: tsdb/1465eb1e-58a1-49e2-beeb-3d1a58823c85 (0) (canary) Task 833007 | 03:36:23 | L starting jobs: tsdb/1465eb1e-58a1-49e2-beeb-3d1a58823c85 (0) (canary) (00:05:46) L Error: 'tsdb/1465eb1e-58a1-49e2-beeb-3d1a58823c85 (0)' is not running after update. Review logs for failed jobs: prometheus Task 833007 | 03:41:25 | Error: 'tsdb/1465eb1e-58a1-49e2-beeb-3d1a58823c85 (0)' is not running after update. Review logs for failed jobs: prometheus Task 833007 Started Mon Sep 26 03:35:16 UTC 2022 Task 833007 Finished Mon Sep 26 03:41:25 UTC 2022 Task 833007 Duration 00:06:09 Task 833007 error Updating deployment: Expected task '833007' to succeed but state is 'error' Exit code 1 ===== 2022-09-26 03:41:25 UTC Finished "/usr/local/bin/bosh --no-color --non-interactive --tty --environment=10.225.6.11 --deployment=p-healthwatch2-b25ff6c377498bc06676 deploy --no-redact /var/tempest/workspaces/default/deployments/p-healthwatch2-b25ff6c377498bc06676.yml"; Duration: 379s; Exit Status: 1 Exited with 1. Exited with 1.

This KB will help you how to troubleshoot and how to look for more details as to why its failing

Environment

Product Version: 2.12

Resolution

Here are some ways on how we can look for more details as to why prometheus is failing.

1. Get logs from the failing vm, based on the sample above its failing on tsdb vm

bosh logs -d p-healthwatch2-b25ff6c377498bc06676 tsdb/1465eb1e-58a1-49e2-beeb-3d1a58823c85

2. Once you have dowloaded the logs. Checked prometheus.stderr.log and look for log entries with level=error.

Let us look closer to these sample logs, the tsdb attempted to start by reading the configuration file but there have been issues evaluating the configured alerting rule due to a syntax error.
Due to incorrect configuration the other processes were stopped.

ts=2022-09-26T07:44:51.388Z caller=main.go:996 level=info msg="TSDB started"
ts=2022-09-26T07:44:51.388Z caller=main.go:1177 level=info msg="Loading configuration file" filename=/var/vcap/jobs/prometheus/config/prometheus-interpolated.yml
ts=2022-09-26T07:44:51.499Z caller=manager.go:974 level=error component="rule manager" msg="loading groups failed" err="/var/vcap/jobs/prometheus/config/alerting.rules.yml: 250:11: group \"UAA\", rule 2, \"UAAHighThroughputRate\": could not parse expression: 1:43: parse error: unexpected character inside braces: ')'"
ts=2022-09-26T07:44:51.499Z caller=main.go:1203 level=error msg="Failed to apply configuration" err="error loading rules, previous rule set restored"
ts=2022-09-26T07:44:51.499Z caller=main.go:831 level=info msg="Stopping scrape discovery manager..."
ts=2022-09-26T07:44:51.499Z caller=main.go:845 level=info msg="Stopping notify discovery manager..."
ts=2022-09-26T07:44:51.499Z caller=manager.go:951 level=info component="rule manager" msg="Stopping rule manager..."
ts=2022-09-26T07:44:51.499Z caller=manager.go:961 level=info component="rule manager" msg="Rule manager stopped"
ts=2022-09-26T07:44:51.499Z caller=main.go:882 level=info msg="Stopping scrape manager..."
ts=2022-09-26T07:44:51.499Z caller=main.go:841 level=info msg="Notify discovery manager stopped"
ts=2022-09-26T07:44:51.500Z caller=main.go:874 level=info msg="Scrape manager stopped"
ts=2022-09-26T07:44:51.499Z caller=main.go:827 level=info msg="Scrape discovery manager stopped"
ts=2022-09-26T07:44:51.500Z caller=manager.go:937 level=info component="rule manager" msg="Starting rule manager..."
ts=2022-09-26T07:44:51.502Z caller=notifier.go:599 level=info component=notifier msg="Stopping notification manager..."
ts=2022-09-26T07:44:51.502Z caller=main.go:1103 level=info msg="Notifier manager stopped"
ts=2022-09-26T07:44:51.502Z caller=main.go:1112 level=error err="error loading config from \"/var/vcap/jobs/prometheus/config/prometheus-interpolated.yml\": one or more errors occurred while applying the new configuration (--config.file=\"/var/vcap/jobs/prometheus/config/prometheus-interpolated.yml\")"

Misconfigured alerts are common cause of why upgrades/apply change fails with prometheus job. Please make sure it is a valid yaml format. For more info on configuring alerts please click here.

3. If the log error is related to tile config, change the necessary configuration and apply change again.
4. If the prometheus.stderr.log does not give any error logs or any indication why prometheus failed then open a support request and attach the logs taken from Step 1 and also include support bundle which can be taken at https://opsmgr-url/api/v0/support_bundle