`Many agents were at stopped and Reconnected (5 and 6 state)
Seeing agent number dropping at same time for both Cluster Collectors/MOM, and the EMs all healthy, so it is unlikely an issue with Cluster/EMs.
Not seeing any other tenants at same time having agent# dropping, so it is unlikely issue with backend shared services either.
Suspecting that it is something on the Customer side. So maybe the application, proxy, firewall, etc.
If a Cloud proxy involved, cloud proxy has its own metrics.
But in this case , something in your infrastructure looks like to be impacting the agent. So next steps would be
- Diagraming end to end flow
- Looking at the logs for each component as needed.
-Determine if settings such as timeout , security, ports, etc. are impacting the agent.
- See if there are any patterns such as traffic loads, latency , etc.