UMA agent Is throwing connection exceptions

book

Article ID: 220614

calendar_today

Updated On:

Products

DX APM SaaS

Issue/Introduction

In APM UI, we can see that Universal Monitoring Agent (UMA agent) is connected, but the metrics view does not show any metrics for the Kubernetes agent. UMA agent is installed in the cluster and is throwing the following exceptions.

[ERROR] [IntroscopeAgent.DefaultMetricCollectionServiceImpl] error occured while getting the api-response (onFailure) from the api-endpoint, http://10.xxx.xxx.49:31314/cluster/namespaces/nodes/stat

java.io.IOException: Canceled

at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:260)

at okhttp3.RealCall$AsyncCall.execute(RealCall.java:201)

at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

[ERROR] [IntroscopeAgent.KubernetesMonitor.PlatformMonitor] Failed to connect http://10.xxx.xxx.49:31314/pods?ns=<namespace>

Cause

This error means "clusterinfo" service is for some reason not accessible.

There can be three possible reasons:

1. UMA clusterinfo pod/service has been brought down by the cluster admin (either for upgrade or for any other maintenance)
2. UMA clusterinfo service port has changed to some other port.
3. UMA clusterinfo pod has some issues and it is not able to serve any request.

Environment

Release : 20.2

Component : APM Agents

Resolution

We suggest following steps to troubleshoot this issue.

1. We would recommend checking with the cluster admin on the state of the clusterinfo pod/service and check if the service is listening on 31314 node port. 
2. If nothing has changed with UMA agent installation, then we would recommend going inside the clusterinfo pod (kubectl exec -it ...), and executing "wget localhost:8080/up" , if this url returns no error,  then there is some connectivity issue in the cluster. 
3. If the above test fails i.e. "wget localhost:8080/up" hangs , then restart the cluster info pod and add liveness probe to clusterinfo. Some versions of UMA agent are still running without liveness probe for clusterinfo.