Troubleshooting unknown/unhealthy Kubernetes (k8s) clusters

Products

CloudHealth

Issue/Introduction

An unknown/unhealthy cluster prevents our collector from communicating with the CloudHealth Kubernetes end-point.

Resolution

1. The first and easiest check to run is confirming the network can access containers-api.edge.cloudhealthtech.com/v1/containers/ and api.cloudhealthtech.com endpoints on port 443. Both use HTTPS and do not use WebSocket.

2. The second check is to run two different curls from the deployed kubernetes collector pod to ensure connection to our endpoints. These MUST be run from the collector pod, not the cluster.

Enter the pod with:

kubectl exec -i --tty <pod-name> -- sh

Run the first curl to our health endpoint:

curl -v -X GET https://containers-api.edge.cloudhealthtech.com/api/v1/health

The expected response:

{"status":"healthy","time":"Fri, 29 Jan 2021 22:48:10 GMT"}

Run the second curl to mock the exact request made by the kubernetes collector (except without any k8s data cache payload).
Be sure to add the auth_token, cluster_id, and cluster_uid to the curl
-The auth token can be pulled from the collector deployment page
-The cluster_id is just the name of the cluster you are deployed the collector to
-The k8s_agent_version can be found on the clusters page https://apps.cloudhealthtech.com/containers_clusters, or https://github.com/CloudHealth/helm/blob/main/cloudhealth-collector-image-docs/CHANGELOG.md
-The cluster_uid can be pulled from the clusters page https://apps.cloudhealthtech.com/containers_clusters, or pod logs

curl --request POST \
  --url 'https://containers-api.edge.cloudhealthtech.com/v2/containers/kubernetes/state?auth_token=EnterToken&cluster_id=EnterName&sample_time=1729088736612&k8s_object_type=cronjobs&k8s_agent_version=EnterVersion&cluster_uid=EnterUID' \
  --header 'Content-Type: application/json' \
  --data '{}'

The expected response (since we sent no payload):

{"result":201}

If either of these curls do not give the expected response then you must refer back to check 1 to ensure the expected endpoints are whitelisted on your network.

3. While we do not officially support proxy configuration, defining JAVA_OPTS environment variables has worked as well:

env:

name: JAVA_OPTS

value: -Dhttp.proxyHost=<PROXY> -Dhttp.proxyPort=8989

Dhttp.nonProxyHosts=kubernetes.default.svc -Dhttps.proxyHost=<PROXY>

Dhttps.proxyPort=8989 -Dhttps.nonProxyHosts=kubernetes.default.svc

4. If the customer is still having trouble, then you can request the pod logs using (Does not require you to be in pod):

kubectl logs --namespace default <pod-name>

Some important things to look for in the logs are the defined containers API endpoint, the agent version, and the cluster UID

Containers API endpoint: If you change the CHT_REGION variable (Setting variables is the first step in the deployment instructions so you can refer back to that guide) to anything other than us-east-1 then the collector will fail to connect. For example, here’s a customer who changed the variable to us-west-2. We can see this error in the logs where the collector can not connect to the us-west-2 endpoint (Containers team will eventually add support for other regions):

The Agent Version: Be sure to check the agent version. We always recommend the most recent one which you can confirm here https://hub.docker.com/r/cloudhealth/container-collector/tags:

The cluster UID: If you don’t see the cluster UID defined, then have the customer run the upgrade command (This is for helm deployed collectors.. manually deployed collectors should refer to the manual install guide):

helm upgrade cloudhealth-collector cloudhealth/cloudhealth-collector