APM OnPrem - The "AgentConnectivity" javascript calculator, a solution to the Agent "ConnectionStatus" limitations
search cancel

APM OnPrem - The "AgentConnectivity" javascript calculator, a solution to the Agent "ConnectionStatus" limitations

book

Article ID: 224265

calendar_today

Updated On:

Products

DX Application Performance Management

Issue/Introduction

This article is about identifying when agents are stopped versus disconnect and then reconnect due to network/connectivity issues in a timely manner. That is, differentiating between:

  • An agent being orderly stopped (and then disconnecting) and then later restarted (and connecting again).

    Let's call this orderly shutdown.

  • An agent disconnecting due to network interruption and then reconnecting without the agent having stopped.

    Let's call this connection interruption.

There are currently no means in the EM or Agent to explicitly differentiate between orderly shutdown - which is most often intentional - and connection interruption - which is most often an issue.

Challenges with current ConnectionStatus metric

There current ConnectionStatus poses several challenges:

  1. The 2 min delay in reporting a disconnect due to a shutdown agent.
  2. The 20 min delay in reporting a disconnect that is due to interruption.
  3. The path of the Connection Status changing by the agent's collector impedes alert definitions.
  4. An agent switching connection between collectors temporarily have two states: Disconnected with the old, and Connected with the new which impedes alert definition.

This calculator resolves this:

  1. Connection State has a stable metric path under the virtual agent | Agents
  2. Consolidates an agent's Connection Status into a single Connection State
  3. States 1 thru 3 corresponds between Connection Status and Connection State
  4. State 4, Alive, indicates that at the agent is connected and alive metrics are being reported by the agent
  5. States 5 thru 8 are new, allowing identification of Shutdown (6), Reconnection (6), Interruption(7), and Age Out (8)

Agent disconnection/reconnection Scenarios.

I: Orderly Shutdown:

The agent stops due to app server stopping or IA stopping.

Metrics' reflection of disconnection

  • ConnectionStatus changes from 1 to 3 immediately.
  • ConnectionStatus=3 reports for 30 minutes, then ceases to report.
  • Launch Time ceases to report when ConnectionStatus=3 (except for 1st cycle).
  • % Time Spent in GC resp. % CPU Utilization (Host) resp. Bytes In Use (and other alive metrics) stop reporting immediately.

Metrics' reflection of reconnection:

  • ConnectionStatus=1 resumes to be reported, may resume after Launch Time is resumed.
  • Changed Launch Time resumes to report, may resume after ConnectionStatus=1 is resumed.

II: Connection Interruption:

The connection to the agent is lost for whatever reason (crash, network, VPN, etc – anything but orderly shutdown).

Metrics' reflection of disconnection

  • ConnectionStatus=1 continues to report for 20 mins, then reports 3 for 30 mins, then ceases to report.
  • Launch Time ceases to report when ConnectionStatus=3 (except for 1st cycle).
  • % Time Spent in GC and
  • % CPU Utilization (Host) and
  • Bytes In Use all stop reporting immediately on interruption.

Metrics' reflection of reconnection

  • % Time Spent in GC and
  • % CPU Utilization (Host) and
  • Bytes In Use all start reporting.
  • ConnectionStatus=1 resumes to be reported, may resume after Launch Time is resumed.
  • Same Launch Time resumes to report, may resume after ConnectionStatus=1 is resumed.

III: Load Balancing:

The agent is reconnected to another collector on:

  1. Load change (aka rebalancing)
  2. loadbalancing.xml change
  3. Collector stopped
  4. Collector-MoM connection lost The agent reconnects to another collector via the MoM.

Notes

  • ConnectionStatus metric not reporting means no data received, as opposed to reporting ConnectionStatus=0, No Data status.
  • % Time Spent in GC metric is available for Java and Infrastructure agents.
  • % CPU Utilization (Host) metric is available for .Net agents.
  • Bytes In Use is available for the EPA and the DxC Agent. All are of metric types that cease to report immediately on agent disruption – stopped or interrupted (types LongFluctuatingCounter and IntegerPercentage respectively)
  • Agent collector is available as part of the agent specifier for ConnectionStatus metric for clustered EMs.

Environment

  • Valid for: APM on-premise versions

Resolution

Solution: deploy the "AgentConnectivity" Javascript calculator (v1.5.3)
 

INSTALLATION STEPS:

 

For APM 2x 

  1. Download attached AgentConnectivity.txt
  2. Rename it as .js, for example AgentConnectivity.js
  3. Login to DX SaaS
  4. Go to APM > Settings > Javascript Extensions
  5. Click "Create New Extension"
  6. Follow steps as documented in Configure JavaScript Extensions 
 

For APM 10x:

  1. Download attached AgentConnectivity.txt
  2. Rename it as .js, for example AgentConnectivity.js
  3. Copy the JavaScript text file into the <EM_Home>/scripts directory.

    For more information refer to Using JavaScript Calculators
 

EXPLANATION OF THE SOLUTION

Disconnection and Reconnection Determination

The default ConnectionStatus is useless on its own for agent Connection State determination. Instead a few “alive metrics” that immediately cease reporting by agent disconnection: % Time Spent in GC resp. % CPU Utilization (Host) resp. Bytes In Use. They are used in conjunction with the ConnectionStatus metric to determine the agent’s Connection State:

  • NoData – 0 ~ Connection Status
  • Unstable – 2 ~ Connection Status
  • Disconnected – 3 ~ Connection Status
  • Alive – 4 — the agent is connected and alive metrics are being received
  • Shutdown – 5 - The agent has been orderly shutdown
  • Reconnected - 6 - The agent has reconnected to another collector
  • Interrupted - 7 - The agent is disorderly stopped or interrupted
  • Aged Out - 8 - The agent hasn't reported for 24 hours

Metrics’ reflection of disconnected

  • ConnectionStatus changes from 1 to 3 on relinquishing collector
  • % Time Spent in GC resp. % CPU Utilization (Host) resp. Bytes In Use continues to report.

Metrics’ reflection of reconnection:

  • ConnectionStatus=1 starts to report on continuing collector,
  • Unchanged Launch Time continues to report
  • % Time Spent in GC resp. % CPU Utilization (Host) resp. Bytes In Use do report.

Disconnected state

  • When % Time Spent in GC resp. % CPU Utilization (Host) resp. Bytes In Use stop reporting

Disconnection cause

  • Disconnected and ConnectionStatus=1 means agent interruption
  • Disconnected and ConnectionStatus=3 means agent stopped
  • Disconnected and Changed agent collector means agent reconnection

Reconnection determination

  • When % Time Spent in GC resp. % CPU Utilization (Host) resp. Bytes In Use start reporting

Disruption cause confirmation

  • Changed Launch Time: Stopped
  • Same Launch Time: Interrupted

Metrics Reported

Connection State is reported continuously with the state determined:

  • NoData – 0
  • Unstable – 2
  • Disconnected – 3
  • Alive – 4 — the agent is connected and alive metrics are being received
  • Shutdown – 5 - The agent has been orderly shutdown
  • Reconnected - 6 - The agent has reconnected to another collector
  • Interrupted - 7 - The agent is disorderly stopped or interrupted
  • Aged Out - 8 - The agent hasn't reported for 24 hours

Algorithm

The calculator runs on the MoM.

The ConnectionStatus reported per agent by collectors are used:

  • 0 - No data
  • 1 - Connected
  • 2 - Slow
  • 3 - Disconnected

A few "Alive metrics" are used to determine if the agent is actually alive, connected, and reporting:

  • % Time Spent in GC
  • % CPU Utilization (Host)
  • Bytes In Use
  • and more to cover more agents

Launch Time reported by agents. A timestamp for the latest agent start.

There are two parts to the script's processing

I: Receiving and saving metrics into a per agent record

As ConnectionStatus is reported by a cluster's collectors, a reconnected agent may have a Disconnected ConnectionStatus with its old collector and a Disconnected ConnectionStatus with its new collector during load balancing. ConnectionStatus is consolidated per agent to reflect if the agent is connected to any collector, otherwise the highest seen ConnectionStatus is reported.

Reception of "Alive metrics" is consolidated into an internal Alive flag per agent: If any alive metrics are received for an agent in a cycle, that agent is flagged as Alive, otherwise it is not.

If, at any time, alive metrics are received for an agent, that agent is from that point onwards considered in "Alive Mode".

For agents NOT in Alive Mode, agent Connection State is reported as a consolidated Connection Status (values 0 thru 3) to ensure Connection State is consistently at least as as useful as Connection Status. That is, if alive metrics are missing for an agent for some reason, Connection State is exactly as useful as Connection Status (and there is no reason to have alternate alerts on Connection Status). Agents in Alive Mode are readily recognized by being in Alive state (4).

II: Determining Connection State and Reporting Metrics

  • NoData – 0
  • Unstable – 2
  • Disconnected – 3
  • Alive – 4
  • Stopped – 5
  • Reconnected - 6
  • Interrupted - 7
  • Aged Out - 8

These rules govern connectionState setting for an agent

Alive metrics are additional metrics - known always to be sent by some agent - that are subscribed to, to establish agent connection interruption.

  1. AliveMode and Alive

    1. AliveMode is false until an alive metric is received in some cycle, then stays true for an agent
    2. Alive is true iff alive metrics have been received in current cycle for an agent
  2. Alive: => AliveMode= true

    1. If saved and received collectors are both not null and are different => State= Reconnected and saved collector= received collector
    2. Else State= Alive
  3. Not Alive

    1. If Alive mode - saved collector is retained and deadCycles++
      1. If connectionStatus == connected && saved connectionState == Alive => connectionState= Interrupted
      2. If connectionStatus == disconnected && saved connectionState == Alive => connectionState= Stopped
      3. if connectionStatus == Unknown & saved connectionState == Alive => connectionState= interrupted
      4. Saved connectionState is Stopped => connectionState= Stopped
      5. Saved connectionState is Interrupted => connectionState= Interrupted
      6. If deadCycles > limit => connectionState= AgedOut and agentRecord deleted
  4. If Not Alive mode - connectionState mimics connectionStatus

    1. ConnectionStatus == connected => connectionState= connected
    2. ConnectionStatus == disconnected => connectionState= disconnected
    3. ConnectionStatus == Unstable => connectionState= unstable
    4. ConnectionState == unknown => connectionState= NoData

Supportability Metrics

Under the virtual agent "|Calculators|Agent Connectivity<version>" these metrics are reported per invocation by the calculator for overview and its management.

  • Agents Monitored - number of agents monitored - identical to the size of the internal table used to hold agent records, that is a measure of retained heap across cycles
  • Agents Aged Out - number agents omitted for not reporting for 24 hours
  • Metrics Consolidated - number of agents connected to multiple collectors
  • Metrics Submitted - number of metrics submitted (excluding supportability)
  • Processing Time (ms) - execution time
  • Metrics Received - number of metrics received by the calculator.

Use these metrics to assess if the calculator is behaving as intended and expected or is consuming excessive resources. If there is a mismatch between expectations and reality you may need to adjust the calculator's code (eventually with the assistance of Broadcom Solution Engineering). Consider de-activating the calculator.

Note: A calculator is always invoked once per cycle, even if no metrics passes its metrics filter. In that case, the calculator receives no metrics but is still invoked.

Under subnode "Received Metrics" these metrics are reported for alive metric received counts:

  • Launch Time
  • Connection Status
  • Bytes In Use - GC Heap|bytes in Use
  • Percentage of Java Heap Used - GC Monitor|percentage of Java heap used
  • % CPU Utilization (Host) - Agent Stats|Resources| PctTimeSpentInGc= ":% Time Spent in GC",
  • etc.

Use these metrics to assess if all intended metrics are being received and nothing more. If there is a mismatch between intensions and reality you may need to adjust the calculator's code (eventually with the assistance of Broadcom Solution Engineering). Consider de-activating the calculator.

Under subnode "Agent States" these metrics are reported for counts of agent in each state:

  • NoData – 0
  • Unstable – 2
  • Disconnected – 3
  • Alive – 4
  • Stopped – 5
  • Reconnected - 6
  • Interrupted - 7
  • Aged Out - 8

Use these metrics to overview number of agent by state and eventually alerting on undesired states. If there is a mismatch between expectations and reality, use agent's Connection State to identify affected agents and the further investigate using other metrics and agent/EM logs.

Controlling Behavior

  • includedAdditionalAliveMetrics= false or true
    • This flag controls the inclusion of additional alive metrics
  • reportConsolidatedConnectionStatus= false or true
    • This flag controls the reporting of the consolidated Connection Status based on the OOTB ConnectionStatus.

Execution Timings

Monitoring 100 agents, experienced timings are

  • runtime: 20 ms
  • runtime: 18 ms - without including additional alive metrics
  • runtime: 16 ms - also without reporting Connection Status

Attachments

1685969865137__AgentConnectivity.txt get_app