Upgrading a Tanzu Kubernetes Grid Integrated Edition cluster hangs on worker or control plane node and task will not cancel
search cancel

Upgrading a Tanzu Kubernetes Grid Integrated Edition cluster hangs on worker or control plane node and task will not cancel

book

Article ID: 317026

calendar_today

Updated On:

Products

Tanzu Kubernetes Grid

Issue/Introduction

Symptoms:
  • During upgrade and possibly other tasks that require bosh director to delete a Tanzu Kubernetes Grid Integrated Edition (TKGI) virtual machine, the task may become hung.
  • You will see bosh make a delete VM task to the CPI and then nothing else
  • You see messages similar to the following in the bosh task --debug output:
D, [2021-08-04T13:40:58.100149 #24551] [instance_update(master/12345678-abcd-ef12-4321-12345678abcd (2))] DEBUG -- DirectorJobRunner: [external-cpi] [cpi-123456] request: {"method":"delete_vm","arguments":["vm-f343da11-337c-78ab-f123-4567ad2ab12346"],"context":{"director_uuid":"87654321-1234-dcba-4321-1234567890ab","request_id":"cpi-123456","vm":{"stemcell":{"api_version":3}},"datacenters":"<redacted>","default_disk_type":"<redacted>","host":"<redacted>","nsxt":"<redacted>","password":"<redacted>","user":"<redacted>"},"api_version":1} with command: /var/vcap/jobs/vsphere_cpi/bin/cpi
  • You see that bosh director is running the /var/vcap/jobs/vsphere_cpi/bin/cpi command. This is the CPI that is on the bosh director VM. This binary should in turn then do what the command was given to it, in this case delete the VM on the cloud provider, vCenter in this example. 


Environment

VMware Tanzu Kubernetes Grid Integrated Edition 1.x

Resolution

This is a known issue affecting TKGI. There is currently no resolution.

Workaround:
From the OpsMan VM, ssh to the bosh director and kill the CPI process noted in the Symptoms section of this article.

Notes:
  • The credentials for the director VM can be found under the credentials tab from the bosh director tile.
  • You can find the process ID of the CPI process by running ps aux | grep cpi and then kill it with kill -9 <process_id#>.