The purpose of this article is to help guide an operator on what steps can be taken to help troubleshoot TKGI upgrade related issues. An upgrade can fail for different reasons. To help troubleshoot the issue, we need to understand the TKGI upgrade workflow.
upgrade-all-service-instances
errand as part of Apply Changes. upgrade-all-service-instances
errand is to re-deploy each service-instance (TKGI cluster) deployment and also re-run service instance errands. Below are the separate BOSH tasks that are executed for each service instances (TKGI Cluster)
service-instance_####
)The sequence diagram below shows the workflow for the upgrading-all-service-instances
errand
Note: SI API stands for the TKGI API
Product Version: All
1. What are the pre-upgrade and target product versions for:
2. Capture the Ops Manager deployment logs as follows
4. Capture the recent BOSH tasks for the pivotal container service deployment
bosh tasks -r=50 | grep 'pivotal-container-service' | grep -v snapshot
5. The BOSH logs for the pivotal container service VM (TKGI tile instance) or service instance (TKGI cluster) that failed
bosh -d <deployment-name> logs
The deployment logs have all the details needed to determine why the upgrade has failed.
To troubleshoot an upgrade failure you will need to identify the following:
1. What executing step of the deployment was running at the time of the failure
2. Make a note of the BOSH task number that was running at the time
3. What VM was being updated or created when the failure occurred
4. What service failed on the VM and what script (pre-start
, post-start
, process itself
, drain
etc.) was running when it failed Sample failed TKGI upgrade deployment log:
Task 11616 | 13:46:35 | Running errand: pivotal-container-service/68ece7b6-ea70-47b5-93d8-de0524f95fe0 (0) (94:57:07) Task 11616 | 12:43:42 | Fetching logs for pivotal-container-service/68ece7b6-ea70-47b5-93d8-de0524f95fe0 (0): Finding and packing log files (00:00:01) Task 11616 Started Thu Jan 24 13:46:33 UTC 2019 Task 11616 Finished Mon Jan 28 12:43:43 UTC 2019 Task 11616 Duration 94:57:10 Task 11616 done Instance pivotal-container-service/68ece7b6-ea70-47b5-93d8-de0524f95fe0 Exit Code 1 Stdout [upgrade-all-service-instances] 2019/01/24 13:46:35.218678 [upgrade-all] STARTING OPERATION with 1 concurrent workers [upgrade-all-service-instances] 2019/01/24 13:46:36.055187 [upgrade-all] Service Instances: 33da4e8d-2b2f-4146-ad79-5e7538485eea dfe67eef-a302-4468-8c7b-fbc6c7dd9e92 [upgrade-all-service-instances] 2019/01/24 13:46:36.055205 [upgrade-all] Total Service Instances found: 2 [upgrade-all-service-instances] 2019/01/24 13:46:36.055210 [upgrade-all] Processing all instances. Attempt 1/5 [upgrade-all-service-instances] 2019/01/24 13:46:36.055220 [upgrade-all] [33da4e8d-2b2f-4146-ad79-5e7538485eea] Starting to process service instance 1 of 2 [upgrade-all-service-instances] 2019/01/24 13:46:38.095906 [upgrade-all] [33da4e8d-2b2f-4146-ad79-5e7538485eea] Result: operation accepted [upgrade-all-service-instances] 2019/01/24 13:46:38.096054 [upgrade-all] [33da4e8d-2b2f-4146-ad79-5e7538485eea] Waiting for operation to complete: bosh task id 11617 [upgrade-all-service-instances] 2019/01/28 12:43:41.539829 [upgrade-all] [33da4e8d-2b2f-4146-ad79-5e7538485eea] Result: Service Instance operation failure [upgrade-all-service-instances] 2019/01/28 12:43:41.540066 [upgrade-all] FINISHED PROCESSING Status: FAILED; Summary: Number of successful operations: 0; Number of service instance orphans detected: 0; Number of deleted instances before operation could happen: 0; Number of busy instances which could not be processed: 0; Number of service instances that failed to process: 1 [33da4e8d-2b2f-4146-ad79-5e7538485eea] Errand 'upgrade-all-service-instances' completed with error (exit code 1) Exit code 1 [upgrade-all-service-instances] 2019/01/28 12:43:41.540173 [33da4e8d-2b2f-4146-ad79-5e7538485eea] Operation failed: bosh task id 11617: Failed for bosh task: 11617, error-message: Action Failed get_task: Task d0c8acc5-938e-48cb-4305-c17031970c72 result: 1 of 2 drain scripts failed. Failed Jobs: kubelet.... Stderr - 1 errand(s) ===== 2019-01-28 12:43:44 UTC Finished "/usr/local/bin/bosh --no-color --non-interactive --tty --environment=10.193.90.11 --deployment=pivotal-container-service-91ee947fa04ad6a35686 run-errand upgrade-all-service-instances"; Duration: 341831s; Exit Status: 1 Exited with 1.
After reviewing the sample deploy log we can determine the following:
Upgrade-all-service-instances
errand was runninguuid
) was being re-deployed when the failure occurredThe BOSH task will supply further information about what was happening when the failure was reported.
bosh task <task-no> --debug
bosh task <task-no> --cpi
Using the same example we can get more details for the failed BOSH task 11617:
ubuntu@Ops-man-2-3-7:~$ bosh task 11617 Using environment ‘10.193.90.11’ as client ‘ops_manager’ Task 11617 Task 11617 | 13:46:39 | Preparing deployment: Preparing deployment Task 11617 | 13:46:41 | Warning: DNS address not available for the link provider instance: pivotal-container-service/68ece7b6-ea70-47b5-93d8-de0524f95fe0 Task 11617 | 13:46:41 | Warning: DNS address not available for the link provider instance: pivotal-container-service/68ece7b6-ea70-47b5-93d8-de0524f95fe0 Task 11617 | 13:46:41 | Warning: DNS address not available for the link provider instance: pivotal-container-service/68ece7b6-ea70-47b5-93d8-de0524f95fe0 Task 11617 | 13:46:52 | Preparing deployment: Preparing deployment (00:00:13) Task 11617 | 13:47:53 | Preparing package compilation: Finding packages to compile (00:00:00) Task 11617 | 13:47:53 | Updating instance master: master/d0bac94e-a731-45d0-a53c-afa681660a11 (0) (canary) (00:05:15) Task 11617 | 13:53:08 | Updating instance master: master/41b1e8ec-a4bc-4f8d-ba39-92af84531931 (2) (00:05:15) Task 11617 | 13:58:23 | Updating instance master: master/79bcdddf-b0a5-43c8-9a3a-992d877c5b96 (1) (00:05:37) Task 11617 | 14:04:00 | Updating instance worker: worker/5c2c7d1f-7c5d-414c-8c04-e79b6661dc87 (0) (canary) (00:05:16) Task 11617 | 14:09:16 | Updating instance worker: worker/2e12e3be-97a5-4eef-bff1-f519b36adb81 (1) (94:33:51) L Error: Action Failed get_task: Task d0c8acc5-938e-48cb-4305-c17031970c72 result: 1 of 2 drain scripts failed. Failed Jobs: kubelet. Successful Jobs: syslog_forwarder. Task 11617 | 12:43:07 | Error: Action Failed get_task: Task d0c8acc5-938e-48cb-4305-c17031970c72 result: 1 of 2 drain scripts failed. Failed Jobs: kubelet. Successful Jobs: syslog_forwarder. Task 11617 Started Thu Jan 24 13:46:39 UTC 2019 Task 11617 Finished Mon Jan 28 12:43:07 UTC 2019 Task 11617 Duration 94:56:28 Task 11617 error Capturing task ‘11617’ output: Expected task ‘11617’ to succeed but state is ‘error’ Exit code 1
The BOSH task output helps to determine the following:
worker/5c2c7d1f-7c5d-414c-8c04-e79b6661dc87
syslog_forwarder
.Note: The result stated that 1 of 2 drain scripts failed. The successful job was syslog_forwarder
and the failed job was kubelet.
We have now determined that the kubelet jobs drain script has failed. The next step is to bosh SSH to the VM that failed and check the failed service logs. The BOSH deployment is service_instance-<service instance>
(cluster uuid
) and the VM is worker/5c2c7d1f-7c5d-414c-8c04-e79b6661dc87
:
bosh -d service_instance-33da4e8d-2b2f-4146-ad79-5e7538485eea ssh worker/5c2c7d1f-7c5d-414c-8c04-e79b6661dc87
Switch to root user:
sudo -i
Go to the log directory of the failing Job (/var/vcap/sys/log/job name/
):
cd /var/vcap/sys/log/kubelet/
List the files by last updated:
ls -lart
In this example we are focused on the drain.stout.log
:
cat drain.stout.log
The above steps can be used to identify why a particular job has failed during the upgrade.
Using the information captured earlier with:
bosh tasks -r=50 | grep 'pivotal-container-service' | grep -v snapshot
We can check if one of the service instance errands has failed. During the upgrade-all-service-instances
errand it will run additional errands for each service instance (TKGI cluster), for example apply-addons and telemetry-agent errands:
1. The apply-addons
errand will deploy system pods to the kube-system
namespace.
2. The telemetry-agent
errand will deploy TKGI system pods to the pks-system
namespace.
kubectl
cli can be utilized to check the health of the pods running in these namespaces.Sample service instance errands output:
ubuntu@Ops-man-2-3-7:~$ bosh tasks -r=30 | grep 'pivotal-container-service' | grep -v snapshot | head -3 12139 done Wed Feb 6 14:08:40 UTC 2019 Wed Feb 6 14:11:26 UTC 2019 pivotal-container-service-91ee947fa04ad6a35686 service-instance_8015d55e-d3de-470a-ae77-048214148653 run errand telemetry-agent from deployment service-instance_8015d55e-d3de-470a-ae77-048214148653 0 succeeded, 1 errored, 0 canceled 12137 done Wed Feb 6 14:05:30 UTC 2019 Wed Feb 6 14:08:33 UTC 2019 pivotal-container-service-91ee947fa04ad6a35686 service-instance_8015d55e-d3de-470a-ae77-048214148653 run errand apply-addons from deployment service-instance_8015d55e-d3de-470a-ae77-048214148653 1 succeeded, 0 errored, 0 canceled 12129 done Wed Feb 6 13:50:23 UTC 2019 Wed Feb 6 14:05:22 UTC 2019 pivotal-container-service-91ee947fa04ad6a35686 service-instance_8015d55e-d3de-470a-ae77-048214148653 create deployment
Note: The errand task may show as done however the errand returned 1 error which means it failed.
The following Steps will help you to troubleshoot a system pod issue:
1. If the customer is unable to connect using their kubectl
cli then do the following:
2. bosh ssh to a master node (any master for that cluster). Create an alias for kubectl so you can run the commands directly from the master:
alias kubectl="/var/vcap/packages/kubernetes/bin/kubectl --kubeconfig=/var/vcap/jobs/kubernetes-roles/config/kubeconfig"
The following kubectl commands can be used to check the health of the pods running in both namespaces
kubectl get pods -n=kube-system
kubectl get pods -n=TKGI-system
If you identify an unhealthy pod then you can run the following to investigate further:
kubectl describe pod <pod name> -n=<namespace>
Note: Refer to kubectl logs for how to view pod logs. Refer to determine the reason for a pod failure.
Find the worker node that the failing pod is running on and review the kubelet logs to try to determine why its failing.
Find the node the pod is running on:
kubectl get pod <pod name> -n=<namespace> -o wide
Determine which worker VM (ip) the node is mapped to:
kubectl get nodes -o wide
Use bosh cli to get the worker VM for that IP:
bosh -d <deployment-name> vms | grep <ip>
bosh ssh to the worker VM and review the kubelet logs which we referred to earlier in:
cd /var/vcap/sys/log/kubelet/
But this time review the kubelet.stout.log
Once you determine the reason for the instances/process/pod issue and resolve it you will need to re-run Apply Changes from Ops Manager UI to continue with the upgrade.
If further assistance is required, reach out for assistance to the Tanzu Support Team.