Unable to recreate VM's with 'bosh cck or recreate' for TKGI cluster

search cancel

Unable to recreate VM's with 'bosh cck or recreate' for TKGI cluster

book

Article ID: 393476

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

The user notices the TKGI cluster Master VM's are in failing and/or unresponsive state.
The user is now unable to use commands such as 'bosh cck' or 'bosh recreate' to fix the failed and unresponsive VM's.
'bosh tasks' command will show failed on task “get_state” task on the failed master instance.

Cause

A command/task was performed on an unavailable Master VM, such as: 'tkgi rotate-certificates —only-nsx'. The VM could have been unavailable due to a network blip or some other reason. Further bosh commands like cck or recreate or deploy could have updated one of the master nodes to have a different certificate and caused it to not be in synch with the other master nodes.

This leaves the cluster in unhealthy state because the Master VMs etcd jobs failed to connect with each other.

Resolution

1.) Run a bosh deploy on the cluster. This is expected to fail. Use this as the "first Master' VM in step 2 below.

2.) Ignore the 'first Master' VM and re-deploy the second Master VM. This step is expected to fail due to the 2nd etcd instance may have issue connecting to third etcd instance. This will start with updating second Master VM.

bosh -d <DEPLOYMENT-NAME> deploy <INSTANCE-NAME.yml> --fix

3.)The bosh deploy task failed again because the second master etcd failed to connect to the third master’s etcd instance.

4.) 'bosh ignore' the second master VM, and run the bosh deploy --fix command again. This should bring all Master VM's and workers back healthy.

5.) 'bosh unignore' the 2 masters that were ignored.

6.) Update the TKGi API to reflect succeeded with command: ‘tkgi upgrade-cluster', else will still show as "failed" when checking cluster status with command 'tkgi clusters'.

Feedback

thumb_up Yes

thumb_down No