How to troubleshoot TKGI upgrade issues

search cancel

How to troubleshoot TKGI upgrade issues

book

Article ID: 298590

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

The purpose of this article is to help guide an operator on what steps can be taken to help troubleshoot TKGI upgrade related issues. An upgrade can fail for different reasons. To help troubleshoot the issue, we need to understand the TKGI upgrade workflow.

TKGI Upgrade Workflow

Upgrade Ops Manager (Only if the target TKGI version requires a higher version of Ops Manager)
Upgrade the TKGI tile
Run the upgrade-all-service-instances errand as part of Apply Changes.
The purpose of the upgrade-all-service-instances errand is to re-deploy each service-instance (TKGI cluster) deployment and also re-run service instance errands. Below are the separate BOSH tasks that are executed for each service instances (TKGI Cluster)
1. Re-deploy the service instance deployment with the updated manifest (service-instance_####)
2. Re-run service instance errands

The sequence diagram below shows the workflow for the upgrading-all-service-instances errand

Note: SI API stands for the TKGI API

Environment

Product Version: All

Resolution

Information to investigate and gather when an TKGI upgrade fails

1. What are the pre-upgrade and target product versions for:

a. Ops Manager
b. TKGI Tile
c. Harbor (optional)
d. NSX-T (optional)

2. Capture the Ops Manager deployment logs as follows

a. Open Ops Manager and select Changelog

b. For the latest Changelog (top) select the Logs link on the far left

3. Capture the full log output and add it to the ticket

4. Capture the recent BOSH tasks for the pivotal container service deployment

bosh tasks -r=50 | grep 'pivotal-container-service' | grep -v snapshot

Note: You can extend the number of tasks (-r=x)

5. The BOSH logs for the pivotal container service VM (TKGI tile instance) or service instance (TKGI cluster) that failed

bosh -d <deployment-name> logs

Determine why the upgrade failed

The deployment logs have all the details needed to determine why the upgrade has failed.

To troubleshoot an upgrade failure you will need to identify the following:

1. What executing step of the deployment was running at the time of the failure

2. Make a note of the BOSH task number that was running at the time

3. What VM was being updated or created when the failure occurred

4. What service failed on the VM and what script (pre-start, post-start, process itself, drain etc.) was running when it failed Sample failed TKGI upgrade deployment log:

Task 11616 | 13:46:35 | Running errand: pivotal-container-service/68ece7b6-ea70-47b5-93d8-de0524f95fe0 (0) (94:57:07)
Task 11616 | 12:43:42 | Fetching logs for pivotal-container-service/68ece7b6-ea70-47b5-93d8-de0524f95fe0 (0): Finding and packing log files (00:00:01)
Task 11616 Started  Thu Jan 24 13:46:33 UTC 2019
Task 11616 Finished Mon Jan 28 12:43:43 UTC 2019
Task 11616 Duration 94:57:10
Task 11616 done
Instance   pivotal-container-service/68ece7b6-ea70-47b5-93d8-de0524f95fe0  
Exit Code  1  
Stdout     [upgrade-all-service-instances] 2019/01/24 13:46:35.218678 [upgrade-all] STARTING OPERATION with 1 concurrent workers  
           [upgrade-all-service-instances] 2019/01/24 13:46:36.055187 [upgrade-all] Service Instances: 33da4e8d-2b2f-4146-ad79-5e7538485eea dfe67eef-a302-4468-8c7b-fbc6c7dd9e92  
           [upgrade-all-service-instances] 2019/01/24 13:46:36.055205 [upgrade-all] Total Service Instances found: 2  
           [upgrade-all-service-instances] 2019/01/24 13:46:36.055210 [upgrade-all] Processing all instances. Attempt 1/5  
           [upgrade-all-service-instances] 2019/01/24 13:46:36.055220 [upgrade-all] [33da4e8d-2b2f-4146-ad79-5e7538485eea] Starting to process service instance 1 of 2  
           [upgrade-all-service-instances] 2019/01/24 13:46:38.095906 [upgrade-all] [33da4e8d-2b2f-4146-ad79-5e7538485eea] Result: operation accepted  
           [upgrade-all-service-instances] 2019/01/24 13:46:38.096054 [upgrade-all] [33da4e8d-2b2f-4146-ad79-5e7538485eea] Waiting for operation to complete: bosh task id 11617  
           [upgrade-all-service-instances] 2019/01/28 12:43:41.539829 [upgrade-all] [33da4e8d-2b2f-4146-ad79-5e7538485eea] Result: Service Instance operation failure  
           [upgrade-all-service-instances] 2019/01/28 12:43:41.540066 [upgrade-all] FINISHED PROCESSING Status: FAILED; Summary: Number of successful operations: 0; Number of service instance orphans detected: 0; Number of deleted instances before operation could happen: 0; Number of busy instances which could not be processed: 0; Number of service instances that failed to process: 1 [33da4e8d-2b2f-4146-ad79-5e7538485eea] 
Errand 'upgrade-all-service-instances' completed with error (exit code 1)
Exit code 1
[upgrade-all-service-instances] 2019/01/28 12:43:41.540173 [33da4e8d-2b2f-4146-ad79-5e7538485eea] Operation failed: bosh task id 11617: Failed for bosh task: 11617, error-message: Action Failed get_task: Task d0c8acc5-938e-48cb-4305-c17031970c72 result: 1 of 2 drain scripts failed. Failed Jobs: kubelet....  
Stderr     -  
1 errand(s)
===== 2019-01-28 12:43:44 UTC Finished "/usr/local/bin/bosh --no-color --non-interactive --tty --environment=10.193.90.11 --deployment=pivotal-container-service-91ee947fa04ad6a35686 run-errand upgrade-all-service-instances"; Duration: 341831s; Exit Status: 1
Exited with 1.

After reviewing the sample deploy log we can determine the following:

Upgrade-all-service-instances errand was running
Service instance 33da4e8d-2b2f-4146-ad79-5e7538485eea (cluster uuid) was being re-deployed when the failure occurred
The failed BOSH task is 11617
The result will highlight the actual failure 1 of 2 drain scripts failed. Failed Jobs: kubelet

Check the failed bosh task to get further information about the failure

The BOSH task will supply further information about what was happening when the failure was reported.

See here for more information about BOSH Tasks
To capture further debug and CPI logs you can run the following:

bosh task <task-no> --debug

bosh task <task-no> --cpi

Using the same example we can get more details for the failed BOSH task 11617:

ubuntu@Ops-man-2-3-7:~$ bosh task 11617
Using environment ‘10.193.90.11’ as client ‘ops_manager’
Task 11617
Task 11617 | 13:46:39 | Preparing deployment: Preparing deployment
Task 11617 | 13:46:41 | Warning: DNS address not available for the link provider instance: pivotal-container-service/68ece7b6-ea70-47b5-93d8-de0524f95fe0
Task 11617 | 13:46:41 | Warning: DNS address not available for the link provider instance: pivotal-container-service/68ece7b6-ea70-47b5-93d8-de0524f95fe0
Task 11617 | 13:46:41 | Warning: DNS address not available for the link provider instance: pivotal-container-service/68ece7b6-ea70-47b5-93d8-de0524f95fe0
Task 11617 | 13:46:52 | Preparing deployment: Preparing deployment (00:00:13)
Task 11617 | 13:47:53 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 11617 | 13:47:53 | Updating instance master: master/d0bac94e-a731-45d0-a53c-afa681660a11 (0) (canary) (00:05:15)
Task 11617 | 13:53:08 | Updating instance master: master/41b1e8ec-a4bc-4f8d-ba39-92af84531931 (2) (00:05:15)
Task 11617 | 13:58:23 | Updating instance master: master/79bcdddf-b0a5-43c8-9a3a-992d877c5b96 (1) (00:05:37)
Task 11617 | 14:04:00 | Updating instance worker: worker/5c2c7d1f-7c5d-414c-8c04-e79b6661dc87 (0) (canary) (00:05:16)
Task 11617 | 14:09:16 | Updating instance worker: worker/2e12e3be-97a5-4eef-bff1-f519b36adb81 (1) (94:33:51)
        L Error: Action Failed get_task: Task d0c8acc5-938e-48cb-4305-c17031970c72 result: 1 of 2 drain scripts failed. Failed Jobs: kubelet. Successful Jobs: syslog_forwarder.
Task 11617 | 12:43:07 | Error: Action Failed get_task: Task d0c8acc5-938e-48cb-4305-c17031970c72 result: 1 of 2 drain scripts failed. Failed Jobs: kubelet. Successful Jobs: syslog_forwarder.
Task 11617 Started  Thu Jan 24 13:46:39 UTC 2019
Task 11617 Finished Mon Jan 28 12:43:07 UTC 2019
Task 11617 Duration 94:56:28
Task 11617 error
Capturing task ‘11617’ output:
Expected task ‘11617’ to succeed but state is ‘error’
Exit code 1

The BOSH task output helps to determine the following:

BOSH was updating VM worker/5c2c7d1f-7c5d-414c-8c04-e79b6661dc87
The failure is the same as in the top level deployment logs: 1 of 2 drain scripts failed. Failed Jobs: kubelet. Successful Jobs: syslog_forwarder.

Note: The result stated that 1 of 2 drain scripts failed. The successful job was syslog_forwarder and the failed job was kubelet.

We have now determined that the kubelet jobs drain script has failed. The next step is to bosh SSH to the VM that failed and check the failed service logs. The BOSH deployment is service_instance-<service instance> (cluster uuid) and the VM is worker/5c2c7d1f-7c5d-414c-8c04-e79b6661dc87:

bosh -d service_instance-33da4e8d-2b2f-4146-ad79-5e7538485eea ssh worker/5c2c7d1f-7c5d-414c-8c04-e79b6661dc87

Switch to root user:

sudo -i

Go to the log directory of the failing Job (/var/vcap/sys/log/job name/):

cd /var/vcap/sys/log/kubelet/

List the files by last updated:

ls -lart

In this example we are focused on the drain.stout.log:

cat drain.stout.log

The above steps can be used to identify why a particular job has failed during the upgrade.

How to identify a service instance errand failure

Using the information captured earlier with:

bosh tasks -r=50 | grep 'pivotal-container-service' | grep -v snapshot

We can check if one of the service instance errands has failed. During the upgrade-all-service-instances errand it will run additional errands for each service instance (TKGI cluster), for example apply-addons and telemetry-agent errands:

1. The apply-addons errand will deploy system pods to the kube-system namespace.

2. The telemetry-agent errand will deploy TKGI system pods to the pks-system namespace.

To investigate issues with these errands the kubectl cli can be utilized to check the health of the pods running in these namespaces.

Sample service instance errands output:

ubuntu@Ops-man-2-3-7:~$ bosh tasks -r=30 | grep 'pivotal-container-service' | grep -v snapshot | head -3
12139   done    Wed Feb  6 14:08:40 UTC 2019    Wed Feb  6 14:11:26 UTC 2019    pivotal-container-service-91ee947fa04ad6a35686  service-instance_8015d55e-d3de-470a-ae77-048214148653   run errand telemetry-agent from deployment service-instance_8015d55e-d3de-470a-ae77-048214148653        0 succeeded, 1 errored, 0 canceled
12137   done    Wed Feb  6 14:05:30 UTC 2019    Wed Feb  6 14:08:33 UTC 2019    pivotal-container-service-91ee947fa04ad6a35686  service-instance_8015d55e-d3de-470a-ae77-048214148653   run errand apply-addons from deployment service-instance_8015d55e-d3de-470a-ae77-048214148653           1 succeeded, 0 errored, 0 canceled
12129   done    Wed Feb  6 13:50:23 UTC 2019    Wed Feb  6 14:05:22 UTC 2019    pivotal-container-service-91ee947fa04ad6a35686  service-instance_8015d55e-d3de-470a-ae77-048214148653   create deployment

Note: The errand task may show as done however the errand returned 1 error which means it failed.

The following Steps will help you to troubleshoot a system pod issue:

1. If the customer is unable to connect using their kubectl cli then do the following:

2. bosh ssh to a master node (any master for that cluster). Create an alias for kubectl so you can run the commands directly from the master:

alias kubectl="/var/vcap/packages/kubernetes/bin/kubectl --kubeconfig=/var/vcap/jobs/kubernetes-roles/config/kubeconfig"

The following kubectl commands can be used to check the health of the pods running in both namespaces

kubectl get pods -n=kube-system

kubectl get pods -n=TKGI-system

If you identify an unhealthy pod then you can run the following to investigate further:

kubectl describe pod <pod name> -n=<namespace>

Note: Refer to kubectl logs for how to view pod logs. Refer to determine the reason for a pod failure.

Find the worker node that the failing pod is running on and review the kubelet logs to try to determine why its failing.

Find the node the pod is running on:

kubectl get pod <pod name> -n=<namespace> -o wide

Determine which worker VM (ip) the node is mapped to:

kubectl get nodes -o wide

Use bosh cli to get the worker VM for that IP:

bosh -d <deployment-name> vms | grep <ip>

bosh ssh to the worker VM and review the kubelet logs which we referred to earlier in:

cd /var/vcap/sys/log/kubelet/

But this time review the kubelet.stout.log

Conclusion

Once you determine the reason for the instances/process/pod issue and resolve it you will need to re-run Apply Changes from Ops Manager UI to continue with the upgrade.

If further assistance is required, reach out for assistance to the Tanzu Support Team.

Feedback

thumb_up Yes

thumb_down No