The following is a high-list of techniques and suggestions to employ when troubleshooting UMA performance, display and configuration issues
DX APM Agent 20.2, SaaS
Suggestion#1: Check if EM or Agent metric clamps have been reached.
a) To Check the EM clamps : Open the Metric Browser, expand the branch
Custom Metric Host (virtual) | Custom Metric Process (virtual) | Custom Metric Agent (virtual)([email protected])(SuperDomain) | Enterprise manager | Connections
looks at the values for:
- "EM Historical Metric Clamped"
- "EM Live Metric Clamped"
The above metrics should all be 0.
To check the Agent clamp : expand the branch
Custom Metric Host (virtual) |Custom Metric Process (virtual) | Custom Metric Agent (virtual)([email protected])(SuperDomain) |Agents | Host | Process |<AgentName>
looks at the value for : "is Clamped" metric, it should be 0.
Suggestion#2: Restart UMA
There is not a restart script instead you need to delete all existing UMA pods as below:
kubectl get pods -n caapm
delete all pods using:
kubectl delete pod <podname> -n caapm
NOTE: there is no sequence as such
You can use the below steps to verify this condition and fix the problem:
Suggestion:
1) Check for error in the <clusterinfo-pod-name-pod-name>
oc logs <clusterinfo-pod-name-pod-name>
2) check the clusterInfo log from inside of the pod
oc rsh <clusterinfo-pod-name>
cd /tmp
cat clusterInfo.log
NOTE: If you cannot login to the pod, try to restart it using: oc delete po <clusterinfo-pod-name>. This is also an indication of a configuration issue.
Here is an example of the message that confirm the permission issue:
WARN io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.onFailure - Exec Failure: HTTP 401, Status: 401 - Unauthorized
Solution:
1. Download and copy attached clusterroles_uma_caapm.yml to your openshift
2. Recreate UMA cluster roles:
oc delete -f clusterroles_uma_caapm.yaml -n capm
oc create -f clusterroles_uma_caapm.yaml -n capm
oc delete pod <clusterinfo-pod-name>
oc delete pod <container-monitor-pod-name>
3. Verify that <clusterinfo-pod-name> is not longer restarting and that the "Unauthorized" error is not longer reported in the clusterInfo.log
You need to review the : app-container-monitor pod logs:
The expected message to confirm that agent is injected into the pod is:
7/27/22 04:53:09 PM GMT [INFO] [IntroscopeAgent.AutoAttach.Java.UnixContainerAttacher] Attach successful for pid 1 in container [ namespace/dockerapp pod wlp-845c4ffcd-8tq4z container/wlp id/0ca17ae4ea2b437687f9fcc880dc6203a93054a4afbb2d1ff9c735e7348a0914 ]
In this example, you can connect to the pod and confirm that wily java agent has been added under the /tmp/ca-deps/wily/ directory, below an example:
kubectl exec -ti wlp-845c4ffcd-8tq4z bash -ndockerapp
cd /tmp/ca-deps/wily/logs/
ls
Agent.jar common core examples logs
AgentNoRedefNoRetrans.jar connectors deploy extensions tools
cd logs
ls -l
total 1496
-rw-r-----. 1 root root 1370757 Jul 27 16:56 AutoProbe.log
-rw-r-----. 1 root root 159419 Jul 27 16:56 IntroscopeAgent.log
Checklist:
1) Check for possible Memory issues
[INFO] [IntroscopeAgent.AutoAttach.Java.UnixDockerAttacher] Not enough free memory available on host to attach to unbounded container [ namespace/digital-factory-uat pod/logstash-sync-db-to-elk-5cc7445d7d-zxrd6 container/logstash-sync-db-to-elk id/56f7bfa2617f2e0f26aa72c8906ee95de9db813c9571ba272b689ae0d87ea310 ]. Skipping attach
identified as Java process in container [ namespace/digital-factory-uat pod/elasticsearch-master-2 container/elasticsearch id/a3004d2bdddd9e246dc9109614f4441d7ddff98716e73e0470afb9cfc2979a49 ]
3/23/21 09:00:19 AM GMT [INFO] [IntroscopeAgent.AutoAttach.Java.UnixDockerAttacher] Container a3004d2bdddd9e246dc9109614f4441d7ddff98716e73e0470afb9cfc2979a49 has lesser memory than configured free memory threshold of 50.0%, Skipping attach
Recommendation:
Change the below default memory threshold to 25% , by changing the value of below env, shown below
- name: apmenv_autoattach_free_memory_threshold
value: "25.00"
When using Operator you can't change anything on UMA side, Operator will revert back the change, in this case you can set annotation on application pod or deployment level as below:
oc annotate pod <pod-name> ca.broadcom.com/autoattach.java.attach.overrides=autoattach.free.memory.threshold=20 -n <app-ns> --overwrite
oc annotate deployment <deployment-name> d ca.broadcom.com/autoattach.java.attach.overrides=autoattach.free.memory.threshold=20 -n <app-ns> --overwrite
2) Check for a possible unsupported JVM
[INFO] [IntroscopeAgent.AutoAttach.Java.UnixDockerAttacher] Process 1 in container [ namespace/digital-factory-uat pod/payment-jobs-7dcd9fd4fc-klm9r container/payment id/636dbbc6da32ff484798b67760810b05027907e190a4d72cc72876d08756472e ] is an unsupported JVM. Skipping attach. JVMInfo: JVMInfo{ binaryPath='/usr/lib/jvm/java-1.8-openjdk/jre/bin/java', vendorName='IcedTea', vmName='OpenJDK 64-Bit Server VM', runtimeVersion='1.8.0_212-b04', specificationVersion='8' }
OR
[INFO] [IntroscopeAgent.AutoAttach.Java.UnixContainerAttacher] Could not retrieve tools.jar in container [ namespace...9d3ddc58 ], please set autoattach.java.tools.repo.url property via annotation or autoattach property and restart app container. See details in the APM documentation for use.
1/19/23 10:24:35 AM GMT [INFO] [IntroscopeAgent.AutoAttach.Java.UnixContainerAttacher] If this is WebSphere Liberty container, please use annotation ca.broadcom.com/autoattach.java.attach.overrides: autoattach.java.filter.jvms=false
Recommendation:
Add the below env to the podmonitor container. (in the same section where the above memory threshold env is present). This will make UMA try to attach java agents to containers that are using unsupported JVMs.
- name: apmenv_autoattach_java_filter_jvms
value: "false"
When using Operator you can't change anything on UMA side, Operator will revert back the change, in this case you can set annotation on application pod or deployment level as below:
oc annotate pod <pod-name> ca.broadcom.com/autoattach.java.attach.overrides=autoattach.java.filter.jvms=false -n <app-ns> --overwrite
oc annotate deployment <deployment-name> ca.broadcom.com/autoattach.java.attach.overrides=autoattach.java.filter.jvms=false -n <app-ns> --overwrite
3) Check if Java Agent cannot be injected because of permission issue
non-root user is not able to create a new directory in the pod to copy java agent
Recommendation:
"exec" into the container and then create a folder like /opt (or anything else) and then use the below annotation so Java agent is deployed in that folder:
kubectl annotate pod <application podname> ca.broadcom.com/autoattach.java.attach.overrides=autoattach.java.agent.deps.directory=/opt
If that works, then modify their Docker app image(s) to to make room for a writeable directory so UMA can use it to inject the agent.
4) Check if the issue is related to java itself
9/07/21 06:33:09 AM GMT [INFO] [IntroscopeAgent.AutoAttach.Java.UnixDockerEnricher] Process 1 in container [ namespace/tams-test pod/tams--437- id/8cc8bc8221f6e0aa15873ea8e582158c083a4a1781e7295c3ede34ea7d2e6f7f ] could not get jvm information. Skipping attach
Recommendation:
exec the pod and try to execute java, make sure it runs successfully, here is an example illustrating a java problem and the reason of the above message so the java agent could not be added to the container.
In this specific use case the solution was to remove the JAVA_TOOL_OPTIONS. You should contact your application team to fix this java issue
Checklist:
1)
20-05-2021 10:37:56 [pool-5-thread-8] ERROR c.c.a.b.s.OpenshiftClusterCrawlerService.watchDeploymentConfigs - error occurred in watchDeploymentConfigs, null
Exception in thread "OkHttp Dispatcher" java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
Solution
Insufficient memory given for clusterinfo java process. Increase the max heap to 1024m , as shown in below line in the uma yaml file and redeploy the UMA. The below line is part of clusterinfo deployment configuration in the yaml file. Changing the memory should resolve the issue.
command: ["/usr/local/openshift/apmia/jre/bin/java", "-Xms64m","-Xmx1024m", "-Dlogging.config=file:/usr/local/openshift/logback.xml", "-jar", "/clusterinfo-1.0.jar"]
2)
oc logs pod/container-monitor-7dcdbc5fb8-6vcvq
5/21/21 12:48:20 PM UTC [ERROR] [IntroscopeAgent.GraphSender] error occurred while sending graph to EM, null
java.lang.NullPointerException
at com.ca.apm.clusterdatareporter.K8sMetaDataGraphAttributeDecorator.getGraph(K8sMetaDataGraphAttributeDecorator.java:107)
at java.lang.Iterable.forEach(Iterable.java:75)
at com.ca.apm.clusterdatareporter.K8sMetaDataGraphAttributeDecorator.getGraph(K8sMetaDataGraphAttributeDecorator.java:93)
at com.ca.apm.clusterdatareporter.K8sMetadataGraphHelper$GraphSender.sendGraph(K8sMetadataGraphHelper.java:601)
at com.ca.apm.clusterdatareporter.K8sMetadataGraphHelper$GraphSender.sendGraphInBatches(K8sMetadataGraphHelper.java:580)
at com.ca.apm.clusterdatareporter.K8sMetadataGraphHelper$GraphSender.lambda$1(K8sMetadataGraphHelper.java:548)
at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Reason:
If you are using 10.7 EM, this error can be ignored, there is not loss of functionality
3) Below error is reported continuous – every 2 mins:-
5/25/21 08:48:20 AM UTC [ERROR] [IntroscopeAgent.GraphSender] error occurred while sending graph to EM, null
java.lang.NullPointerException
Solution:
If you are using SaaS APM, you need to set the agentManager_version to empty value (i.e. ""), you can do this by changing the following parameter "agentManager_version: "" ", in the yaml.
NOTE: If you are using APM EM 10.7, you need to set version = 10.7 as below for example. This is required to allow UMA to connect to APM EM 10.7. This property is the equivalent to "introscope.agent.connection.compatibility.version" in the Java agent.
You noticed that there are many app-container-monitor reporting above message:
This is a known issue fixed in 21.4
Recommendation: upgrade to latest UMA 21.11 and onward versions.
Collect logs from the following pods:
-app-container-monitor-* (there should be 1 pod for each node)
-cluster-performance-prometheus-*
-clusterinfo-*
-container-monitor-*
Here is an example of the commands (if you are using openshift you can use "oc" command):
kubectl logs <app-container-pod-name> --all-containers -ncaapm
kubectl logs <cluster-performance-prometheus> --all-containers -ncaapm
kubectl logs <clusterinfo-pod-name> --all-containers -ncaapm
kubectl logs <container-monitor-pod-name> --all-containers -ncaapm
First check for podmonitor container in the daemonset pods there shouldn't be any connection errors
kubectl logs <app-container-pod-name> -c podmonitor -n caapm
NOTE: If the issue is related to java agent not getting injected as expected, then the most important log to collect is the app-container-<pod-name, you should have 1 app-container-pod-name on each node, make sure to collect the log from the right node where the issue is happening (from where your java application is running)