Solution: deploy the "AgentConnectivity" Javascript calculator (v1.5.3)
INSTALLATION STEPS:
For APM 2x
- Download attached AgentConnectivity.txt
- Rename it as .js, for example AgentConnectivity.js
- Login to DX SaaS
- Go to APM > Settings > Javascript Extensions
- Click "Create New Extension"
- Follow steps as documented in Configure JavaScript Extensions
For APM 10x:
- Download attached AgentConnectivity.txt
- Rename it as .js, for example AgentConnectivity.js
- Copy the JavaScript text file into the <EM_Home>/scripts directory.
For more information refer to Using JavaScript Calculators
EXPLANATION OF THE SOLUTION
Disconnection and Reconnection Determination
The default ConnectionStatus is useless on its own for agent Connection State determination. Instead a few “alive metrics” that immediately cease reporting by agent disconnection: % Time Spent in GC resp. % CPU Utilization (Host) resp. Bytes In Use. They are used in conjunction with the ConnectionStatus metric to determine the agent’s Connection State:
- NoData – 0 ~ Connection Status
- Unstable – 2 ~ Connection Status
- Disconnected – 3 ~ Connection Status
- Alive – 4 — the agent is connected and alive metrics are being received
- Shutdown – 5 - The agent has been orderly shutdown
- Reconnected - 6 - The agent has reconnected to another collector
- Interrupted - 7 - The agent is disorderly stopped or interrupted
- Aged Out - 8 - The agent hasn't reported for 24 hours
Metrics’ reflection of disconnected
- ConnectionStatus changes from 1 to 3 on relinquishing collector
- % Time Spent in GC resp. % CPU Utilization (Host) resp. Bytes In Use continues to report.
Metrics’ reflection of reconnection:
- ConnectionStatus=1 starts to report on continuing collector,
- Unchanged Launch Time continues to report
- % Time Spent in GC resp. % CPU Utilization (Host) resp. Bytes In Use do report.
Disconnected state
- When % Time Spent in GC resp. % CPU Utilization (Host) resp. Bytes In Use stop reporting
Disconnection cause
- Disconnected and ConnectionStatus=1 means agent interruption
- Disconnected and ConnectionStatus=3 means agent stopped
- Disconnected and Changed agent collector means agent reconnection
Reconnection determination
- When % Time Spent in GC resp. % CPU Utilization (Host) resp. Bytes In Use start reporting
Disruption cause confirmation
- Changed Launch Time: Stopped
- Same Launch Time: Interrupted
Metrics Reported
Connection State is reported continuously with the state determined:
- NoData – 0
- Unstable – 2
- Disconnected – 3
- Alive – 4 — the agent is connected and alive metrics are being received
- Shutdown – 5 - The agent has been orderly shutdown
- Reconnected - 6 - The agent has reconnected to another collector
- Interrupted - 7 - The agent is disorderly stopped or interrupted
- Aged Out - 8 - The agent hasn't reported for 24 hours
Algorithm
The calculator runs on the MoM.
The ConnectionStatus reported per agent by collectors are used:
- 0 - No data
- 1 - Connected
- 2 - Slow
- 3 - Disconnected
A few "Alive metrics" are used to determine if the agent is actually alive, connected, and reporting:
- % Time Spent in GC
- % CPU Utilization (Host)
- Bytes In Use
- and more to cover more agents
Launch Time reported by agents. A timestamp for the latest agent start.
There are two parts to the script's processing
I: Receiving and saving metrics into a per agent record
As ConnectionStatus is reported by a cluster's collectors, a reconnected agent may have a Disconnected ConnectionStatus with its old collector and a Disconnected ConnectionStatus with its new collector during load balancing. ConnectionStatus is consolidated per agent to reflect if the agent is connected to any collector, otherwise the highest seen ConnectionStatus is reported.
Reception of "Alive metrics" is consolidated into an internal Alive flag per agent: If any alive metrics are received for an agent in a cycle, that agent is flagged as Alive, otherwise it is not.
If, at any time, alive metrics are received for an agent, that agent is from that point onwards considered in "Alive Mode".
For agents NOT in Alive Mode, agent Connection State is reported as a consolidated Connection Status (values 0 thru 3) to ensure Connection State is consistently at least as as useful as Connection Status. That is, if alive metrics are missing for an agent for some reason, Connection State is exactly as useful as Connection Status (and there is no reason to have alternate alerts on Connection Status). Agents in Alive Mode are readily recognized by being in Alive state (4).
II: Determining Connection State and Reporting Metrics
- NoData – 0
- Unstable – 2
- Disconnected – 3
- Alive – 4
- Stopped – 5
- Reconnected - 6
- Interrupted - 7
- Aged Out - 8
These rules govern connectionState setting for an agent
Alive metrics are additional metrics - known always to be sent by some agent - that are subscribed to, to establish agent connection interruption.
-
AliveMode and Alive
- AliveMode is false until an alive metric is received in some cycle, then stays true for an agent
- Alive is true iff alive metrics have been received in current cycle for an agent
-
Alive: => AliveMode= true
- If saved and received collectors are both not null and are different => State= Reconnected and saved collector= received collector
- Else State= Alive
-
Not Alive
- If Alive mode - saved collector is retained and deadCycles++
- If connectionStatus == connected && saved connectionState == Alive => connectionState= Interrupted
- If connectionStatus == disconnected && saved connectionState == Alive => connectionState= Stopped
- if connectionStatus == Unknown & saved connectionState == Alive => connectionState= interrupted
- Saved connectionState is Stopped => connectionState= Stopped
- Saved connectionState is Interrupted => connectionState= Interrupted
- If deadCycles > limit => connectionState= AgedOut and agentRecord deleted
-
If Not Alive mode - connectionState mimics connectionStatus
- ConnectionStatus == connected => connectionState= connected
- ConnectionStatus == disconnected => connectionState= disconnected
- ConnectionStatus == Unstable => connectionState= unstable
- ConnectionState == unknown => connectionState= NoData
Supportability Metrics
Under the virtual agent "|Calculators|Agent Connectivity<version>" these metrics are reported per invocation by the calculator for overview and its management.
- Agents Monitored - number of agents monitored - identical to the size of the internal table used to hold agent records, that is a measure of retained heap across cycles
- Agents Aged Out - number agents omitted for not reporting for 24 hours
- Metrics Consolidated - number of agents connected to multiple collectors
- Metrics Submitted - number of metrics submitted (excluding supportability)
- Processing Time (ms) - execution time
- Metrics Received - number of metrics received by the calculator.
Use these metrics to assess if the calculator is behaving as intended and expected or is consuming excessive resources. If there is a mismatch between expectations and reality you may need to adjust the calculator's code (eventually with the assistance of Broadcom Solution Engineering). Consider de-activating the calculator.
Note: A calculator is always invoked once per cycle, even if no metrics passes its metrics filter. In that case, the calculator receives no metrics but is still invoked.
Under subnode "Received Metrics" these metrics are reported for alive metric received counts:
- Launch Time
- Connection Status
- Bytes In Use - GC Heap|bytes in Use
- Percentage of Java Heap Used - GC Monitor|percentage of Java heap used
- % CPU Utilization (Host) - Agent Stats|Resources| PctTimeSpentInGc= ":% Time Spent in GC",
- etc.
Use these metrics to assess if all intended metrics are being received and nothing more. If there is a mismatch between intensions and reality you may need to adjust the calculator's code (eventually with the assistance of Broadcom Solution Engineering). Consider de-activating the calculator.
Under subnode "Agent States" these metrics are reported for counts of agent in each state:
- NoData – 0
- Unstable – 2
- Disconnected – 3
- Alive – 4
- Stopped – 5
- Reconnected - 6
- Interrupted - 7
- Aged Out - 8
Use these metrics to overview number of agent by state and eventually alerting on undesired states. If there is a mismatch between expectations and reality, use agent's Connection State to identify affected agents and the further investigate using other metrics and agent/EM logs.
Controlling Behavior
- includedAdditionalAliveMetrics= false or true
- This flag controls the inclusion of additional alive metrics
- reportConsolidatedConnectionStatus= false or true
- This flag controls the reporting of the consolidated Connection Status based on the OOTB ConnectionStatus.
Execution Timings
Monitoring 100 agents, experienced timings are
- runtime: 20 ms
- runtime: 18 ms - without including additional alive metrics
- runtime: 16 ms - also without reporting Connection Status