Troubleshoot TKGI Lifecycle Operations: tkgi create-cluster/rotate-certificates/update-cluster and others

search cancel

Troubleshoot TKGI Lifecycle Operations: tkgi create-cluster/rotate-certificates/update-cluster and others

book

Article ID: 388409

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

This article provides a step-by-step approach to identifying the source of an issue of a TKGI lifecycle operation.

Refer to Tanzu Kubernetes Grid Integrated Edition Architecture and the TKGI Control Plane Overview section for more information on the TKGI Lifecycle of Kubernetes clusters.

Here, we will layout simple steps for tracing down an issue that any TKGI operator/administrator can take in order to determine whether an issue lies within a running Kubernetes workload, the TKGI infrastructure or within one or more TKGI components.

Environment

A supported Infrastructure Provider (IaaS) such as vSphere, Azure, AWS or GCP.

A supported CNI such as Antrea or VMware NSX

BOSH

Kubernetes Clusters

Resolution

To narrow down where a TKGI lifecycle operation is failing, one path is the following:

Look at the status of the K8s cluster

tkgi cluster CLUSTER_NAME

From the output of above look at the Last Action Description field. Note any task-id associated. bosh task ID and any mentions as to which bosh JOB(s) failed.

EXAMPLES:

UUID: <BOSH_K8S_CLUSTER_SERVICE_INSTANCE_ID>
Last Action: CREATE
Last Action State: failed
Last Action Description: Instance provisioning failed: There was a problem completing your request. Please contact your operations team providing the following information: service: p.pks, service-instance-guid: <TKGI_CONTROL_PLANE_SERVICE_INSTANCE_ID>, broker-request-id: <BOSH_BROKER_REQUEST_ID>, task-id: <task no.>, operation: create, error-message: 0 succeeded, 1 errored, 0 canceled

UUID: BOSH_K8S_CLUSTER_SERVICE_INSTANCE_ID
Last Action: UPGRADE
Last Action State: failed
Last Action Description: Failed for bosh task: <task no.>, error-message: 0 succeeded, 1 errored, 0 canceled

OR SOMETHING SIMILAR TO:

Last Action Description: ... 1 of 6 post-start scripts failed. Failed Jobs: kubelet. Successful Jobs: bosh-dns, telemetry-agent-image, load-antrea-images, load-images, sink-resources-images ...

Task 78 | 13:31:49 | Updating instance worker: worker/########-####-####-####-############ (1) (00:23:43)
Task 78 | 13:32:40 | Updating instance worker: worker/########-####-####-####-############ (2) (00:24:34)
L Error: Action Failed get_task: Task ########-####-####-####-############ result: 1 of 6 post-start scripts failed. Failed Jobs: kubelet. Successful Jobs: bosh-dns, telemetry-agent-image, load-antrea-images, load-images, sink-resources-images.
Task 78 | 13:32:40 | Error: Action Failed get_task: Task ########-####-####-####-############ result: 1 of 6 post-start scripts failed. Failed Jobs: kubelet. Successful Jobs: bosh-dns, telemetry-agent-image, load-antrea-images, load-images, sink-resources-images.

To check for a failed cluster deploy, check the status of the cluster from TKGI api: tkgi cluster CLUSTER_NAME

From the output, look for the error description, the bosh task ID, and any mentions as to which bosh JOB(s) failed.
Then examine the output of bosh task <task no.> to check which bosh node it failed on, the first bosh job that failed (if mentioned)
From the above, narrow down on any failed NODE and bosh job JOB associated with it
then look at the associated logs for the NODE -> JOB
"bosh ssh" to the node and look at logs under /var/vcap/sys/log/JOB_NAME/* or gather "bosh logs" for the cluster and look at them through the logs bundle.

Feedback

thumb_up Yes

thumb_down No