This article provides a step-by-step approach to identifying the source of an issue of a TKGI lifecycle operation.
Refer to Tanzu Kubernetes Grid Integrated Edition Architecture and the TKGI Control Plane Overview section for more information on the TKGI Lifecycle of Kubernetes clusters.
Here, we will layout simple steps for tracing down an issue that any TKGI operator/administrator can take in order to determine whether an issue lies within a running Kubernetes workload, the TKGI infrastructure or within one or more TKGI components.
A supported Infrastructure Provider (IaaS) such as vSphere, Azure, AWS or GCP.
A supported CNI such as Antrea or VMware NSX
BOSH
Kubernetes Clusters
tkgi cluster CLUSTER_NAME
EXAMPLES:
UUID: <BOSH_K8S_CLUSTER_SERVICE_INSTANCE_ID>
Last Action: CREATE
Last Action State: failed
Last Action Description: Instance provisioning failed: There was a problem completing your request. Please contact your operations team providing the following information: service: p.pks, service-instance-guid: <TKGI_CONTROL_PLANE_SERVICE_INSTANCE_ID>, broker-request-id: <BOSH_BROKER_REQUEST_ID>, task-id: <task no.>, operation: create, error-message: 0 succeeded, 1 errored, 0 canceled
OR
UUID: BOSH_K8S_CLUSTER_SERVICE_INSTANCE_ID
Last Action: UPGRADE
Last Action State: failed
Last Action Description: Failed for bosh task: <task no.>, error-message: 0 succeeded, 1 errored, 0 canceled
OR SOMETHING SIMILAR TO:
Last Action Description: ... 1 of 6 post-start scripts failed. Failed Jobs: kubelet. Successful Jobs: bosh-dns, telemetry-agent-image, load-antrea-images, load-images, sink-resources-images ...
Task 78 | 13:31:49 | Updating instance worker: worker/########-####-####-####-############ (1) (00:23:43)
Task 78 | 13:32:40 | Updating instance worker: worker/########-####-####-####-############ (2) (00:24:34)
L Error: Action Failed get_task: Task ########-####-####-####-############ result: 1 of 6 post-start scripts failed. Failed Jobs: kubelet. Successful Jobs: bosh-dns, telemetry-agent-image, load-antrea-images, load-images, sink-resources-images.
Task 78 | 13:32:40 | Error: Action Failed get_task: Task ########-####-####-####-############ result: 1 of 6 post-start scripts failed. Failed Jobs: kubelet. Successful Jobs: bosh-dns, telemetry-agent-image, load-antrea-images, load-images, sink-resources-images.
To check for a failed cluster deploy, check the status of the cluster from TKGI api: tkgi cluster CLUSTER_NAME
From the output, look for the error description, the bosh task ID, and any mentions as to which bosh JOB(s) failed.
Then examine the output of bosh task <task no.> to check which bosh node it failed on, the first bosh job that failed (if mentioned)
From the above, narrow down on any failed NODE and bosh job JOB associated with it
then look at the associated logs for the NODE -> JOB
"bosh ssh" to the node and look at logs under /var/vcap/sys/log/JOB_NAME/* or gather "bosh logs" for the cluster and look at them through the logs bundle.