Metrics Store not reachable over gorouter because metrics stores health endpoint crashes and metrics graph not available

search cancel

Metrics Store not reachable over gorouter because metrics stores health endpoint crashes and metrics graph not available

book

Article ID: 417880

calendar_today

Updated On:

Products

VMware Tanzu Application Service

Issue/Introduction

Outage of the metrics store tile and related metrics can be observed during stemcell update of the time and VM recreation process.

Visible outcome: it is not possible to see any metrics in the apps metrics dashboard. All metrics panel have a red border indicating a technical problem.

The route registrar is running a health check and if it fails no route will be registered at the gorouter and the process running on the port the health check is trying to reach is unavailable

The problem occurs during stemcell update

  stemcells:
  - name: bosh-xxxxxx-ubuntu-jammy-go_agent
-   version: '1.894'
+   version: '1.906'

In case of rolling back the same issue will be still present for 10 min and then recover.

The snippet below is from the metric-store/metric-store.stderr.log

{"level":"info","timestamp":"2025-09-26T14:31:48.365Z","app":"metric-store.appMetrics-XXXXXXXXX,XXXXX,config-service","message":"Discoverer channel closed","level":"debug","component":"discovery
manager notify","provider":"static/0"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x2936bb5]

goroutine 578 [running]:
github.com/cloudfoundry/metric-store-release/src/internal/storage.(*ReplicatedQuerier).queryWithNodeFailover(0xc16ff885a0, {0x3c9d0b8, 0xc17001a2d0}, {0xc11089f6c0, 0x2, 0x2}, 0xd0?, 0x1?, {0xc16
f3e0c40, 0x5, ...})
        /var/vcap/data/compile/metric-store/src/internal/storage/replicated_querier.go:187 +0x115
github.com/cloudfoundry/metric-store-release/src/internal/storage.(*ReplicatedQuerier).queryWithRetries(0xc5617b?, {0x3c9d0b8, 0xc17001a2d0}, {0xc11089f6c0, 0x2, 0x2}, 0x58?, 0x33c7520?, {0xc16f3
e0c40, 0x5, ...})

Environment

Metric store 1.7

Cause

This issue will occur when queries are executed during cluster instability (like stemcell upgrades or node failures). The system tries to query a failed node and doesn't properly handle the nil response, causing a panic.

Stemcell Upgrade scenario: During the upgrade, nodes are being recreated/restarted (resulting in connection refused). Let's say node3 is getting restarted here
Query Execution: Once metric-store job is started, it executes a prometheus query /api/v1/query_range from the restarted node (node3), at the exact same time node2 is getting restarted
Timing: During the upgrade, the routing table will still list the stopped node as available, causing queries to be routed to it
Failover Logic Activates:

queryWithNodeFailover is called to query across nodes
querierFactory.Build() creates queriers for multiple nodes
For the stopped node, NewRemoteQuerier() might succeed in creating a querier object (constructor completes), but the querier is in a bad state
When remoteQuerier.Select() is called on this bad querier, it returns nil instead of a valid SeriesSet object
The code then tries to call result.Err() on a nil result → PANIC

Resolution

While the system recovers automatically, the specific query fails "remoteQuerier.Select()". When upgrade completes issues will resolve on its own. This issue is fixed in Metrics Store version v1.8.2

Feedback

thumb_up Yes

thumb_down No