grafana VM fails in HealthWatch deployment

Products

VMware Tanzu Application Service

Issue/Introduction

During an upgrade to Healthwatch 2.3.1, the grafana VM was observed to fail. Logging into the grafana VM and running "monit summary" showed that the healthwatch_route_registrar job was failing.

Task 725235 | 18:37:52 | Updating instance grafana: grafana/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx (0) (canary)
Task 725235 | 18:37:52 | L executing pre-stop: grafana/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx (0) (canary)
Task 725235 | 18:37:52 | L executing drain: grafana/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx (0) (canary)
Task 725235 | 18:37:53 | L stopping jobs: grafana/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx (0) (canary)
Task 725235 | 18:38:02 | L executing post-stop: grafana/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx (0) (canary)
Task 725235 | 18:41:46 | L installing packages: grafana/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx (0) (canary)
Task 725235 | 18:41:56 | L configuring jobs: grafana/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx (0) (canary)
Task 725235 | 18:41:56 | L executing pre-start: grafana/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx (0) (canary)
Task 725235 | 18:41:57 | L starting jobs: grafana/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx (0) (canary) (00:09:06)
                       L Error: 'grafana/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx (0)' is not running after update. Review logs for failed jobs: healthwatch_route_registrar
Updating deployment:
  Expected task '725235' to succeed but state is 'error'
Task 725235 | 18:46:58 | Error: 'grafana/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx (0)' is not running after update. Review logs for failed jobs: healthwatch_route_registrar

Environment

Healthwatch 2.3.1

TAS 4.0.10

Cause

Firewall was blocking communication between the healthwatch2 deployment and the TAS nats VMs on port 4224.

Resolution

Collect logs from the grafana VM:

 bosh -d <healthwatch-deployment-name> logs grafana/<guid>

or inspect them directly on the VM. The logs are under /var/vcap/sys/log/healthwatch_route_registrar. In the healthwatch_route_registrar.stdout.log file, you may see entries like this:

{"timestamp":"2025-02-10T23:25:58.089681603Z","level":"fatal","source":"Route Registrar","message":"Route Registrar.Exiting with error","data":{"error":"dial tcp 10.XXX.XXX.XXX:4224: i/o timeout","trace":"goroutine 1 [running]:\ncode.cloudfoundry.org/lager/v3.(*logger).Fatal(0xc0000e71f0, {0x7bae12, 0x12}, {0x84bea0, 0xc000018c80}, {0x0, 0x0, 0x0?})\n\t/Users/xxxxxxxx/go/pkg/mod/code.cloudfoundry.org/lager/[email protected]/logger.go:166 +0x1f3\nmain.main()\n\t/Users/xxxxxxxx/workspace/healthwatch/releases/grafana-release/src/route-registrar/main.go:157 +0x122b\n"}}
{"timestamp":"2025-02-10T23:26:09.104042887Z","level":"error","source":"Route Registrar","message":"Route Registrar.nats-connection-failed","data":{"error":"dial tcp 10.XXX.XXX.XXX:4224: i/o timeout","nats-hosts":["nats.service.cf.internal:4224"]}}
{"timestamp":"2025-02-10T23:26:09.104140436Z","level":"fatal","source":"Route Registrar","message":"Route Registrar.Exiting with error","data":{"error":"dial tcp 10.XXX.XXX.XXX:4224: i/o timeout","trace":"goroutine 1 [running]:\ncode.cloudfoundry.org/lager/v3.(*logger).Fatal(0xc0000e7180, {0x7bae12, 0x12}, {0x84bea0, 0xc000018c80}, {0x0, 0x0, 0x0?})\n\t/Users/xxxxxxxx/go/pkg/mod/code.cloudfoundry.org/lager/[email protected]/logger.go:166 +0x1f3\nmain.main()\n\t/Users/xxxxxxxx/workspace/healthwatch/releases/grafana-release/src/route-registrar/main.go:157 +0x122b\n"}}
{"timestamp":"2025-02-10T23:26:20.118600912Z","level":"error","source":"Route Registrar","message":"Route Registrar.nats-connection-failed","data":{"error":"dial tcp 10.XXX.XXX.XXX:4224: i/o timeout","nats-hosts":["nats.service.cf.internal:4224"]}}

These IP addresses correspond to the NATS VMs. We performed a bosh restart on these VMs (bosh -d <TAS-deployment> restart nats), then logged into the grafana VM.

From there, we attempted to connect to the nats VMs using 'nc' and 'traceroute' (nc -vz <nats_VM_IP_address> 4224) (traceroute <nats_VM_IP_address> 4224); both commands hung. This means that communication is blocked at the firewall.

Customer must contact their internal firewall team and let them know that communication to the NATS VMs on port 4224 is required for the Healthwatch grafana VM.