Metric loss in all Tenants

book

Article ID: 224305

calendar_today

Updated On:

Products

DX SaaS

Issue/Introduction

Observed metric gaps on 09/19/2021 from 08:36 AM -08:55 AM , across all the collectors in every Tenant.

Please investigate and let know know what caused the issue.

 

Cause

Root Cause

During root cause investigation, it was identified there were two causes that resulted in the data gaps within APM. Due to the UPS agent migration, It was identified that there was an increase in load on one of the core components which resulted in a block of ingestion for 45 seconds. When APM tried to catch up with the data within that 45 seconds it resulted in a further 45 seconds as the service was catching up.  During the investigation it was also identified that there was a problematic host that was unable to connect to a remote service which caused issues in storing the metrics for the tenant. To resolve this issue the health probe detected a connectivity issue and automatically restarted the service.

https://bsg-confluence.broadcom.net/pages/viewpage.action?pageId=39063040

 

 

Environment

Release : SAAS

Component : Integration with APM

Resolution

 

Action taken to avoid these in the future:

1. Enhance internal monitoring to identify unhealthy nodes and move pods to healthier nodes.

Due Date: Sep 28, 2021

Status: Completed

2. Code improvement to use asynchronous refresh of configuration that is independent of metric ingestion

Due date; Nov 15, 2021 

Status: In Progress