This is about trying to identify when agents disconnect and then reconnect due to network/connectivity issues. So trying to differentiate between:
Let's call this normal reconnection
Let's call this connection interruption.
There is currently nothing in the EM or Agent to explicitly differentiate between these disconnect/reconnect conditions.
The agent stops due to app server stopping or IA stopping
The connection to the agent is lost for whatever reason (crash, network, VPN, etc – anything but orderly shutdown).
The agent is reconnected to another collector:
Release : 20.2
Component : APM Agents
The reasons detailed above renders the ConnectionStatus useless on its own for agent Connection State determination. Instead “alive metrics” that are immediately impacted by agent connection disruption are used: % Time Spent in GC resp. % CPU Utilization (Host) resp. Bytes In Use. They are used in conjunction with the ConnectionStatus metric to determine the agent’s Connection State:
Connection State is reported continuously with the state determined:
Timings are optionally reported dependant on the ReportTimings flag:
The calculator runs on the MoM.
The ConnectionStatus reported per agent by collectors are used:
"Alive metrics" are used to determine if the agent is actually alive, connected, and reporting:
Launch Time reported by agents. A timestamp for the latest agent start.
There are two parts to the script's processing
As ConnectionStatus is reported by a cluster's collectors, an reconnected agent may have a Disconnected ConnectionStatus with one or more collectors and a Disconnected ConnectionStatus with a single collector (e.g. due to load balancing). ConnectionStatus is consolidated per agent to reflect if the agent is connected to any collector, otherwise the highest seen ConnectionStatus.
Reception of "Alive metrics" is consolidated Alive flag per agent: If any alive metrics is received for an agent in a cycle, that agent is flagged as Alive, otherwise it is not.
For all-in-one EMs (on-prem or SaaS) consolidation is not necessary.
The Connection State metric is determined when a agent become disrupted (i.e. no alive metric has been received for the cycle):
This Connection State is kept until the agent is again seen to be Alive (i.e. Alive metrics are received). This is because ConnectionStatus changes due to grace delays in collectors' connection management - these changes are ignored as they are not related to the cause of connection disruption.
Following example shows result detected when network connection for an agent was interrupted. Note the ConnectionStatus remains at 1 but the ConnectionState goes to 5 until the network interruption was over.
Henrik Nissen Ravn, ESD Solution Engineering.
Copyright Broadcom 2021.
All rights reserved.
Any use requires an active DX APM license.