Unable to recreate master instances with 'bosh cck or recreate' for TKGI cluster
search cancel

Unable to recreate master instances with 'bosh cck or recreate' for TKGI cluster

book

Article ID: 393476

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

  1. The user notices the TKGI cluster Master instances are in failing and/or unresponsive state. 
  2. The user is now unable to use commands such as `bosh cck` or `bosh recreate` to fix the failed and unresponsive instances.
  3. `bosh tasks` command will show failed on task `get_state` task on the failed master instances.

Cause

A command/task was performed on an unavailable Master VM, such as: `tkgi rotate-certificates —only-nsx`. The VM could have been unavailable due to a network blip or some other reason. Further bosh commands like cck or recreate or deploy could have updated one of the master nodes to have a different certificate and caused it to not be in synch with the other master nodes.

This leaves the cluster in unhealthy state because the Master VMs etcd jobs failed to connect with each other.

 

Resolution

  1. Obtain deployment manifest of the cluster with `bosh -d DEPLOYMENT manifest > FILE`. 
  2. Run `bosh deploy` with the cluster.  This is expected to fail at the first master instance, because etcd on the first master instance can not connect to other etcd to build up quorum. 
  3.  `bosh ignore` the first master instance and `bosh deploy` so that the second master instance could be recreated. This step should succeed as etcd on the 2nd master instance can connected to etcd on 1st master instance to build up quorum. The deployment will continue until all cluster deployment finishes. 
  4. `bosh unignore` the first master instance which was ignored.
  5. Update the TKGi API  to reflect succeeded with command: `tkgi upgrade-cluster`, else will still show as "failed" when checking cluster status with command  `tkgi clusters`.