HealthWatch Grafana VM health dashboard shows gaps intermittently
search cancel

HealthWatch Grafana VM health dashboard shows gaps intermittently

book

Article ID: 371547

calendar_today

Updated On:

Products

VMware Tanzu Application Service

Issue/Introduction

On Healthwatch Grafana dashboard, all VMs health state except for BOSH director are reported with a gap, though no instances or any jobs inside those instances are unhealthy. 

 

Environment

  • TAS 2.0 and above
  • Healthwatch v2.x

Cause

The issue usually occurs on large scale platform. 

VM health is one type of gauge metrics. The component pas-exporter-gauge in Healthwatch Exporter for TAS is in charge of transferring gauge data from TAS firehose v2 to Healthwatch prometheus. The gauge metrics come from 4 major sources:

  • system metrics(such as CPU, memory, disk usage) from all instances 
  • TAS job metrics 
  • app container metrics on TAS
  • job metrics from other tile such as RabbitMQ

Currently one single pas-exporter-gauge can not process the data stream fast enough so that some metrics get dropped by doppler and reverse-log-proxy(firehose v2). This can be confirmed with CF CLI log-cache-plugin, with steps:

  1. `bosh ssh` into Healthwatch Exporter for TAS > pas-exporter-gauge
  2. `netstat -anpt | grep 8082` to identify which TAS trafficontroller instance the pas-exporter-gauge process connects to. 
  3. `cf tail reverse-log-proxy -f  --json | egrep "counter.*dropped" | grep <IP_ADDRESS>

In the case the output show delta >0 and increasing total dropped value, it could prove that the single pas-exporter-gauge can not consume gauge metrics fast enough. 

{"timestamp":"1723131798369992000","source_id":"reverse_log_proxy","instance_id":"","deprecated_tags":{},"tags":{"deployment":"cf-54926cb29f45e0241f81","direction":"egress","index":"522be8e9-eb4e-44be-9c50-f7f716cc40d0","ip":"10.225.30.216","job":"loggregator_trafficcontroller","metric_version":"2.0","origin":"loggregator.rlp","product":"VMware Tanzu Application Service","system_domain":"###.###.###"},"counter":{"name":"dropped","delta":"2343","total":"34694540"}}

Resolution

As pas-exporter-gauge can not be scaled to multiple instances, you can try to 

  • vertically scale pas-exporter-gaute with more CPU
  • increase TAS > System Logging > System metrics scrape interval from 15s to bigger value like 60s
  • increase RabbitMQ(if it is on the platform) metrics polling interval from 30s to bigger value like 60s

those workaround will add more CPU resource and reduce data volume. 

Besides, there is internal discussion how to resolve the issue from the root, please contact support if you have any questions.