The following is a high-list of techniques and suggestions to employ when troubleshooting UMA performance, display and configuration issues
DX APM
The official UMA Troubleshooting section is available from here
IMPORTANT:
Suggestion#1: Check if EM or Agent metric clamps have been reached.
a) To Check the EM clamps : Open the Metric Browser, expand the branch
Custom Metric Host (virtual) | Custom Metric Process (virtual) | Custom Metric Agent (virtual)(collector_host@port)(SuperDomain) | Enterprise manager | Connections
looks at the values for:
- "EM Historical Metric Clamped"
- "EM Live Metric Clamped"
The above metrics should all be 0.
To check the Agent clamp : expand the branch
Custom Metric Host (virtual) |Custom Metric Process (virtual) | Custom Metric Agent (virtual)(collector_host@port)(SuperDomain) |Agents | Host | Process |<AgentName>
looks at the value for : "is Clamped" metric, it should be 0.
Suggestion#2: Restart UMA
There is not a restart script instead you need to delete all existing UMA pods as below:
kubectl get pods -n caapm
delete all pods using:
kubectl delete pod <podname> -n caapm
NOTE: there is no sequence as such
You can use the below steps to verify this condition and fix the problem:
Suggestion:
1) Check for error in the <clusterinfo-pod-name-pod-name>
oc logs <clusterinfo-pod-name-pod-name>
2) check the clusterInfo log from inside of the pod
oc rsh <clusterinfo-pod-name>
cd /tmp
cat clusterInfo.log
NOTE: If you cannot login to the pod, try to restart it using: oc delete po <clusterinfo-pod-name>. This is also an indication of a configuration issue.
Here is an example of the message that confirm the permission issue:
WARN io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.onFailure - Exec Failure: HTTP 401, Status: 401 - Unauthorized
Solution:
1. Download and copy attached clusterroles_uma_caapm.yml to your openshift
2. Recreate UMA cluster roles:
oc delete -f clusterroles_uma_caapm.yaml -n capm
oc create -f clusterroles_uma_caapm.yaml -n capm
oc delete pod <clusterinfo-pod-name>
oc delete pod <container-monitor-pod-name>
3. Verify that <clusterinfo-pod-name> is not longer restarting and that the "Unauthorized" error is not longer reported in the clusterInfo.log
Checklist
1) You need to review the : app-container-monitor pod logs
2) Check there is not error in the podmonitor container:
kubectl logs <app-container-pod-name> -c podmonitor -n caapm
3) the expected message to confirm that agent is injected into the pod is:
[INFO] [IntroscopeAgent.AutoAttach.Java.UnixContainerAttacher] Attach successful for pid 1 in container
4) in your app-pod a /tmp/ca-deps/wily folder should have been created, here is an example how to confirm that the java agent has been attached:
kubectl exec -ti <your-app-pod> bash -ndockerapp
cd /tmp/ca-deps/wily/logs/
ls
Agent.jar common core examples logs
AgentNoRedefNoRetrans.jar connectors deploy extensions tools
cd logs
ls -l
total 1496
-rw-r-----. 1 root root 1370757 Jul 27 16:56 AutoProbe.log
-rw-r-----. 1 root root 159419 Jul 27 16:56 IntroscopeAgent.log
5) Make sure the "tar" command is available from the app image
The "tar" unix command/app is required to be able to uninstall the agent package
In the app-container-monitor-<podname>.log you will see this error:
dataprovider.go:567] Err while executing command '["sh" "-c" "systick=$(getconf CLK_TCK); for c in /proc/*/cmdline; do d=$(dirname $c); name=$(grep Name: $d/status 2>/dev/null) || continue; pid=$(basename $d); uid=$(grep Uid: $d/status 2>/dev/null) || continue; uid=$(echo ${uid#Uid:} | xargs); uid=${uid%% *}; cmdline=$(cat $c|xargs -0 echo 2>/dev/null) || continue; starttime=$(($(awk '{print $22}' $d/stat 2>/dev/null || echo 0) / systick)); uptime=$(awk '{print int($1)}' /proc/uptime); elapsed=$(($uptime-$starttime)); echo $pid $uid $elapsed $cmdline; done"]' for container "#######": err: <nil>, result -> out: , err: sh: xargs: command not found
sh: xargs: command not found
sh: xargs: command not found
..
6) Check for possible Memory issues
[INFO] [IntroscopeAgent.AutoAttach.Java.UnixDockerAttacher] Not enough free memory available on host to attach to unbounded container . Skipping attach
Container ... has lesser memory than configured free memory threshold of 50.0%, Skipping attach
Recommendation:
Change the below default memory threshold to 25% , by changing the value of below env, shown below
- name: apmenv_autoattach_free_memory_threshold
value: "25.00"
When using Operator you can't change anything on UMA side, Operator will revert back the change, in this case you can set annotation on application pod or deployment level as below:
oc annotate pod <pod-name> ca.broadcom.com/autoattach.java.attach.overrides=autoattach.free.memory.threshold=20 -n <app-ns> --overwrite
oc annotate deployment <deployment-name> d ca.broadcom.com/autoattach.java.attach.overrides=autoattach.free.memory.threshold=20 -n <app-ns> --overwrite
7) Check for a possible unsupported JVM
[INFO] [IntroscopeAgent.AutoAttach.Java.UnixDockerAttacher] Process 1 in container .. is an unsupported JVM. Skipping attach. JVMInfo: JVMInfo{ binaryPath='/usr/lib/jvm/java-1.8-openjdk/jre/bin/java', vendorName='IcedTea', vmName='OpenJDK 64-Bit Server VM', runtimeVersion='1.8.0_212-b04', specificationVersion='8' }
OR
[INFO] [IntroscopeAgent.AutoAttach.Java.UnixContainerAttacher] Could not retrieve tools.jar in container [ namespace...9d3ddc58 ], please set autoattach.java.tools.repo.url property via annotation or autoattach property and restart app container. See details in the APM documentation for use.
[INFO] [IntroscopeAgent.AutoAttach.Java.UnixContainerAttacher] If this is WebSphere Liberty container, please use annotation ca.broadcom.com/autoattach.java.attach.overrides: autoattach.java.filter.jvms=false
Recommendation:
Add the below env to the podmonitor container. (in the same section where the above memory threshold env is present). This will make UMA try to attach java agents to containers that are using unsupported JVMs.
- name: apmenv_autoattach_java_filter_jvms
value: "false"
When using Operator you can't change anything on UMA side, Operator will revert back the change, in this case you can set annotation on application pod or deployment level as below:
oc annotate pod <pod-name> ca.broadcom.com/autoattach.java.attach.overrides=autoattach.java.filter.jvms=false -n <app-ns> --overwrite
oc annotate deployment <deployment-name> ca.broadcom.com/autoattach.java.attach.overrides=autoattach.java.filter.jvms=false -n <app-ns> --overwrite
8) Check if Java Agent cannot be injected because of permission issue
non-root user is not able to create a new directory in the pod to copy java agent
Recommendation:
"exec" into the container and then create a folder like /opt (or anything else) and then use the below annotation so Java agent is deployed in that folder:
kubectl annotate pod <application podname> ca.broadcom.com/autoattach.java.attach.overrides=autoattach.java.agent.deps.directory=/opt
If that works, then modify their Docker app image(s) to to make room for a writeable directory so UMA can use it to inject the agent.
9) Check if the issue is related to java itself
[INFO] [IntroscopeAgent.AutoAttach.Java.UnixDockerEnricher] Process 1 in container .. could not get jvm information. Skipping attach
Recommendation:
exec the pod and try to execute java, make sure it runs successfully, here is an example illustrating a java problem and the reason of the above message so the java agent could not be added to the container.
In this specific use case the solution was to remove the JAVA_TOOL_OPTIONS. You should contact your application team to fix this java issue
If you have already configured dynamic property resolution either on UMA side or using autoattach override annotations that takes precedence over the environment variables set.
To check this is to look at the "/tmp/ca-deps/ca-apm-java-agent.options" file and if there is a property already added for an agent name in this file, that takes precedence over the environment variable set.
For more information refer to https://techdocs.broadcom.com/us/en/ca-enterprise-software/it-operations-management/dx-apm-agents/SaaS/Universal-Monitoring-Agent/Install-the-Universal-Monitoring-Agent/Install-UMA-for-OpenShift/Install-and-configure-uma-using-openshift-operator.html
Checklist:
1)
ERROR c.c.a.b.s.OpenshiftClusterCrawlerService.watchDeploymentConfigs - error occurred in watchDeploymentConfigs, null
Exception in thread "OkHttp Dispatcher" java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
Solution
Insufficient memory given for clusterinfo java process. Increase the max heap to 1024m , as shown in below line in the uma yaml file and redeploy the UMA. The below line is part of clusterinfo deployment configuration in the yaml file. Changing the memory should resolve the issue.
command: ["/usr/local/openshift/apmia/jre/bin/java", "-Xms64m","-Xmx1024m", "-Dlogging.config=file:/usr/local/openshift/logback.xml", "-jar", "/clusterinfo-1.0.jar"]
2)
oc logs pod/container-monitor-7dcdbc5fb8-6vcvq
[ERROR] [IntroscopeAgent.GraphSender] error occurred while sending graph to EM, null
java.lang.NullPointerException
at com.ca.apm.clusterdatareporter.K8sMetaDataGraphAttributeDecorator.getGraph(K8sMetaDataGraphAttributeDecorator.java:107)
at java.lang.Iterable.forEach(Iterable.java:75)
at com.ca.apm.clusterdatareporter.K8sMetaDataGraphAttributeDecorator.getGraph(K8sMetaDataGraphAttributeDecorator.java:93)
Reason:
If you are using 10.7 EM, this error can be ignored, there is not loss of functionality
3) Below error is reported continuous – every 2 mins:-
[ERROR] [IntroscopeAgent.GraphSender] error occurred while sending graph to EM, null
java.lang.NullPointerException
Solution:
If you are using SaaS APM, you need to set the agentManager_version to empty value (i.e. ""), you can do this by changing the following parameter "agentManager_version: "" ", in the yaml.
NOTE: If you are using APM EM 10.7, you need to set version = 10.7 as below for example. This is required to allow UMA to connect to APM EM 10.7. This property is the equivalent to "introscope.agent.connection.compatibility.version" in the Java agent.
You noticed that there are many app-container-monitor reporting above message:
This is a known issue fixed in 21.4
Recommendation: upgrade to latest UMA 21.11 and onward versions.
Changing the java agent properties attached to the application can be done through the annotation method as documented in above section, below another example where we change the values for multiple agent properties:
In openshift:
oc annotate deployment <deployment name> ca.broadcom.com/autoattach.java.agent.overrides="introscope.autoprobe.logclassdetails.enabled=true,introscope.autoprobe.enable.tracergroup.ClassLocationTracing=true,introscope.agent.log.level.root=DEBUG" -n <namespace>
Option 1 (Recommended): use https://packages.broadcom.com/artifactory/apm-agents/getUmaLogs.sh to collect the full set of logs
Option 2: Collect the logs from the below pods:
-app-container-monitor-* (there should be 1 pod for each node)
-cluster-performance-prometheus-*
-clusterinfo-*
-container-monitor-*
Here is an example of the commands (if you are using openshift you can use "oc" command):
kubectl logs <app-container-monitor-pod-name> --all-containers -ncaapm
kubectl logs <app-container-pod-name> -c podmonitor -n caapm
kubectl logs <cluster-performance-prometheus> --all-containers -ncaapm
kubectl logs <clusterinfo-pod-name> --all-containers -ncaapm
kubectl logs <container-monitor-pod-name> --all-containers -ncaapm
NOTE: If the issue is related to java agent not getting injected as expected, then the most important log to collect is the app-container-<pod-name, you should have 1 app-container-pod-name on each node, make sure to collect the log from the right node where the issue is happening (from where your java application is running)