UMA on APM 10.7 does not report metrics from Openshift Agent

book

Article ID: 209363

calendar_today

Updated On:

Products

DX Application Performance Management

Issue/Introduction

 
 

Experience of strange behaviour UMA v.20.4.0.17:

The Openshift Agent for the service that was reporting all the metric, the last replica of the pods was not captured yet and not connected with node.

Only the pods (that are running) under IA are visible

 

but not under the node, like this previous pod replica:

 

Cause

 

There was an issue with the clusterinfo pod which could not be contacted.

The Java autoattach module gets some pod related data from clusterInfo before attaching agents. Since clusterinfo was not returning any data attach module was failing.

The clusterinfo was not responding because of known issues in this older release that caused this to happen because of exceptions in clusterinfo internal threads, such as this:

java.lang.NullPointerException
 at com.ca.apm.broadcom.controllers.cluster.ClusterMetadataController.lambda$1(ClusterMetadataController.java:133)
 at java.lang.Iterable.forEach(Iterable.java:75)
 at com.ca.apm.broadcom.controllers.cluster.ClusterMetadataController.lambda$0(ClusterMetadataController.java:125)
 at java.util.HashMap.forEach(HashMap.java:1289)
 at com.ca.apm.broadcom.controllers.cluster.ClusterMetadataController.getPodMetricsForNode(ClusterMetadataController.java:121)
 at sun.reflect.GeneratedMethodAccessor80.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)

The exceptions seen in clusterinfo log have been fixed in latest releases.

Environment

Release : 10.7.0

Component : APM Agents

Resolution

 
 

The suggestion was to redeploy the clusterinfo pod with a liveness check

livenessProbe:
  httpGet:
    path: /up
    port: 8080
 initialDelaySeconds: 60
 periodSeconds: 120

An attached version of clusterinfo.yaml for 20.4 is provided

Additional Information

 

As we have asked the user to add the livenessprobe, if the issue happens again in their env pod will get restarted automatically. 

Attachments

clusterinfo (1)_1616181722564.yaml get_app