Here are some ways on how we can look for more details as to why prometheus is failing.
1. Get logs from the failing vm, based on the sample above its failing on tsdb vm
bosh logs -d p-healthwatch2-b25ff6c377498bc06676 tsdb/1465eb1e-58a1-49e2-beeb-3d1a58823c85
2. Once you have dowloaded the logs. Checked prometheus.stderr.log and look for log entries with level=error.
Let us look closer to these sample logs, the tsdb attempted to start by reading the configuration file but there have been issues evaluating the configured alerting rule due to a syntax error.
Due to incorrect configuration the other processes were stopped.
ts=2022-09-26T07:44:51.388Z caller=main.go:996 level=info msg="TSDB started"
ts=2022-09-26T07:44:51.388Z caller=main.go:1177 level=info msg="Loading configuration file" filename=/var/vcap/jobs/prometheus/config/prometheus-interpolated.yml
ts=2022-09-26T07:44:51.499Z caller=manager.go:974 level=error component="rule manager" msg="loading groups failed" err="/var/vcap/jobs/prometheus/config/alerting.rules.yml: 250:11: group \"UAA\", rule 2, \"UAAHighThroughputRate\": could not parse expression: 1:43: parse error: unexpected character inside braces: ')'"
ts=2022-09-26T07:44:51.499Z caller=main.go:1203 level=error msg="Failed to apply configuration" err="error loading rules, previous rule set restored"
ts=2022-09-26T07:44:51.499Z caller=main.go:831 level=info msg="Stopping scrape discovery manager..."
ts=2022-09-26T07:44:51.499Z caller=main.go:845 level=info msg="Stopping notify discovery manager..."
ts=2022-09-26T07:44:51.499Z caller=manager.go:951 level=info component="rule manager" msg="Stopping rule manager..."
ts=2022-09-26T07:44:51.499Z caller=manager.go:961 level=info component="rule manager" msg="Rule manager stopped"
ts=2022-09-26T07:44:51.499Z caller=main.go:882 level=info msg="Stopping scrape manager..."
ts=2022-09-26T07:44:51.499Z caller=main.go:841 level=info msg="Notify discovery manager stopped"
ts=2022-09-26T07:44:51.500Z caller=main.go:874 level=info msg="Scrape manager stopped"
ts=2022-09-26T07:44:51.499Z caller=main.go:827 level=info msg="Scrape discovery manager stopped"
ts=2022-09-26T07:44:51.500Z caller=manager.go:937 level=info component="rule manager" msg="Starting rule manager..."
ts=2022-09-26T07:44:51.502Z caller=notifier.go:599 level=info component=notifier msg="Stopping notification manager..."
ts=2022-09-26T07:44:51.502Z caller=main.go:1103 level=info msg="Notifier manager stopped"
ts=2022-09-26T07:44:51.502Z caller=main.go:1112 level=error err="error loading config from \"/var/vcap/jobs/prometheus/config/prometheus-interpolated.yml\": one or more errors occurred while applying the new configuration (--config.file=\"/var/vcap/jobs/prometheus/config/prometheus-interpolated.yml\")"
Misconfigured alerts are common cause of why upgrades/apply change fails with prometheus job. Please make sure it is a valid yaml format. For more info on configuring alerts please click
here.
3. If the log error is related to tile config, change the necessary configuration and apply change again.
4. If the prometheus.stderr.log does not give any error logs or any indication why prometheus failed then open a support request and attach the logs taken from Step 1 and also include support bundle which can be taken at
https://opsmgr-url/api/v0/support_bundle