Link to download the connector
https://github.gwd.broadcom.net/ESD/APM-Agent-Connectivity
Readme information
Agent Connectivity
DX APM Javascript Calculator for Agent Connectivity Metrics
Problem Statement
As per Sean: This is about trying to identify when agents disconnect and then reconnect due to network/connectivity issues. So trying to differentiate between:
- An agent stopping (and disconnecting) and then restarting (and re-connecting).
Let's call this normal reconnection
- An agent disconnecting due to network interruption and then reconnecting without the agent having stopped.
Let's call this connection interruption.
There is currently nothing in the EM or Agent to explicitly differentiate between these disconnect/reconnect conditions.
Agent disconnection/reconnection Scenarios.
I: Orderly shutdown:
The agent stops due to app server stopping or IA stopping
Metrics reflection of disconnection
- ConnectionStatus changes from 1 to 3 immediately
- ConnectionStatus=3 reports for 30 minutes, then ceases to report
- Launch Time ceases to report when ConnectionStatus=3 (except for 1st cycle)
- % Time Spent in GC resp. % CPU Utilization (Host) resp. Bytes In Use stop reporting immediately on orderly shutdown
Metrics reflection of reconnection:
- ConnectionStatus=1 resumes to be reported, may resume after Launch Time
- Changed Launch Time resumes to report, may resume after ConnectionStatus=1
II: Interruption:
The connection to the agent is lost for whatever reason (crash, network, VPN, etc – anything but orderly shutdown).
Metrics reflection of disconnection
- ConnectionStatus=1 continues to report for 20 mins, then reports 3 for 30 mins, then stops
- Launch Time ceases to report when ConnectionStatus=3 (except for 1st cycle)
- % Time Spent in GC resp. % CPU Utilization (Host) resp. Bytes In Use stop reporting immediately on inter
Metrics reflection of reconnection
- % Time Spent in GC resp. % CPU Utilization (Host) resp. Bytes In Use start reporting
- ConnectionStatus=1 resumes to be reported, may be resumed after Launch Time
- Same Launch Time resumes to report, may be resumed after ConnectionStatus=1
III: Load Balancing:
The agent is reconnected to another collector:
- Load change or loadbalancing.xml change
- Connector stopped
- Connector connection lost The agent reconnects via MoM.
Notes
- a ConnectionStatus metric not reporting means no data received, as opposed to reporting ConnectionStatus=0, No Data status.
- % Time Spent in GC metric is available for Java and Infrastructure agents. % CPU Utilization (Host) metric is available for .Net agents. Bytes In Use is available for the EPA and the DxC Agent. All are of metric types that cease to report immediately on agent disruption – stopped or interrupted (types LongFluctuatingCounter and IntegerPercentage respectively)
- Agent collector is available as part of the agent specifier for ConnectionStatus metric for clustered EMs.
Disconnection and Reconnection Determination
The reasons detailed above renders the ConnectionStatus useless on its own for agent Connection State determination. Instead “alive metrics” that are immediately impacted by agent connection disruption are used: % Time Spent in GC resp. % CPU Utilization (Host) resp. Bytes In Use. They are used in conjunction with the ConnectionStatus metric to determine the agent’s Connection State:
- noData – 0
- alive – 1
- reconnected – 2
- stopped – 3
- unstable – 4
- interrupted – 5
- stale - 6
Metrics’ reflection of disconnected
- ConnectionStatus changes from 1 to 3 on relinquishing collector
- % Time Spent in GC resp. % CPU Utilization (Host) resp. Bytes In Use continues to report.
Metrics’ reflection of reconnection:
- ConnectionStatus=1 starts to report on continuing collector,
- Unchanged Launch Time continues to report
- % Time Spent in GC resp. % CPU Utilization (Host) resp. Bytes In Use do report.
Disconnected state
- When % Time Spent in GC resp. % CPU Utilization (Host) resp. Bytes In Use stop reporting
Disconnection cause
- Disconnected and ConnectionStatus=1 means agent interruption
- Disconnected and ConnectionStatus=3 means agent stopped
- Disconnected and Changed agent collector means agent reconnection
Reconnection determination
- When % Time Spent in GC resp. % CPU Utilization (Host) resp. Bytes In Use start reporting
Disruption cause confirmation
- Changed Launch Time: Stopped
- Same Launch Time: Interrupted
Metrics Reported
Connection State is reported continuously with the state determined:
- noData – 0
- alive – 1
- reconnected – 2
- stopped – 3
- unstable – 4
- interrupted – 5
- stale – 6
Timings are optionally reported dependant on the ReportTimings flag:
- Stopped Time: reported when Connection State=3, increasing 15 secs per cycle in Stopped Connection State.
- Interrupted Time: reported when connection state=5, increasing 15 secs per cycle Interrupted Connection State.
- Reconnection Time: reported when reconnection is determined. As reconnection is seen as Stopped until the agent is reconnected to the continuing collector. The seen Stopped Time is then reported as Reconnection Time for one cycle.
Algorithm
The calculator runs on the MoM.
The ConnectionStatus reported per agent by collectors are used:
- 0 - No data
- 1 - Connected
- 2 - Intermittent issues
- 3 - Disconnected
"Alive metrics" are used to determine if the agent is actually alive, connected, and reporting:
- % Time Spent in GC
- % CPU Utilization (Host)
- Bytes In Use.
Launch Time reported by agents. A timestamp for the latest agent start.
There are two parts to the script's processing
I: Receiving and saving metrics into a per agent table
As ConnectionStatus is reported by a cluster's collectors, an reconnected agent may have a Disconnected ConnectionStatus with one or more collectors and a Disconnected ConnectionStatus with a single collector (e.g. due to load balancing). ConnectionStatus is consolidated per agent to reflect if the agent is connected to any collector, otherwise the highest seen ConnectionStatus.
Reception of "Alive metrics" is consolidated Alive flag per agent: If any alive metrics is received for an agent in a cycle, that agent is flagged as Alive, otherwise it is not.
For all-in-one EMs (on-prem or SaaS) consolidation is not necessary.
II: Determining Connection State and Reporting Metrics
The Connection State metric is determined when a agent become disrupted (i.e. no alive metric has been received for the cycle):
- ConnectionStatus=1: Interrupted
- ConnectionStatus=2: Unstable
- ConnectionStatus=3: Stopped
This Connection State is kept until the agent is again seen to be Alive (i.e. Alive metrics are received). This is because ConnectionStatus changes due to grace delays in collectors' connection management - these changes are ignored as they are not related to the cause of connection disruption.
Henrik Nissen Ravn, ESD Solution Engineering.
Copyright Broadcom 2021.
All rights reserved.
Any use requires an active DX APM license.