Observed metric gaps on 09/19/2021 from 08:36 AM -08:55 AM , across all the collectors in every Tenant.
Please investigate and let know know what caused the issue.
Root Cause
During root cause investigation, it was identified there were two causes that resulted in the data gaps within APM. Due to the UPS agent migration, It was identified that there was an increase in load on one of the core components which resulted in a block of ingestion for 45 seconds. When APM tried to catch up with the data within that 45 seconds it resulted in a further 45 seconds as the service was catching up. During the investigation it was also identified that there was a problematic host that was unable to connect to a remote service which caused issues in storing the metrics for the tenant. To resolve this issue the health probe detected a connectivity issue and automatically restarted the service.
https://bsg-confluence.broadcom.net/pages/viewpage.action?pageId=39063040
Release : SAAS
Component : Integration with APM
Action taken to avoid these in the future:
1. Enhance internal monitoring to identify unhealthy nodes and move pods to healthier nodes.
Due Date: Sep 28, 2021
Status: Completed
2. Code improvement to use asynchronous refresh of configuration that is independent of metric ingestion
Due date; Nov 15, 2021
Status: In Progress