On Healthwatch Grafana dashboard, all VMs health state except for BOSH director are reported with a gap, though no instances or any jobs inside those instances are unhealthy.
The issue usually occurs on large scale platform.
VM health is one type of gauge metrics. The component pas-exporter-gauge in Healthwatch Exporter for TAS is in charge of transferring gauge data from TAS firehose v2 to Healthwatch prometheus. The gauge metrics come from 4 major sources:
Currently one single pas-exporter-gauge can not process the data stream fast enough so that some metrics get dropped by doppler and reverse-log-proxy(firehose v2). This can be confirmed with CF CLI log-cache-plugin, with steps:
In the case the output show delta >0 and increasing total dropped value, it could prove that the single pas-exporter-gauge can not consume gauge metrics fast enough.
{"timestamp":"1723131798369992000","source_id":"reverse_log_proxy","instance_id":"","deprecated_tags":{},"tags":{"deployment":"cf-54926cb29f45e0241f81","direction":"egress","index":"522be8e9-eb4e-44be-9c50-f7f716cc40d0","ip":"10.225.30.216","job":"loggregator_trafficcontroller","metric_version":"2.0","origin":"loggregator.rlp","product":"VMware Tanzu Application Service","system_domain":"###.###.###"},"counter":{"name":"dropped","delta":"2343","total":"34694540"}}
As pas-exporter-gauge can not be scaled to multiple instances, you can try to
those workaround will add more CPU resource and reduce data volume.
Besides, there is internal discussion how to resolve the issue from the root, please contact support if you have any questions.