A command/task was performed on an unavailable Master VM, such as: 'tkgi rotate-certificates —only-nsx'. The VM could have been unavailable due to a network blip or some other reason. Further bosh commands like cck or recreate or deploy could have updated one of the master nodes to have a different certificate and caused it to not be in synch with the other master nodes.
This leaves the cluster in unhealthy state because the Master VMs etcd jobs failed to connect with each other.
1.) Run a bosh deploy on the cluster. This is expected to fail. Use this as the "first Master' VM in step 2 below.
2.) Ignore the 'first Master' VM and re-deploy the second Master VM. This step is expected to fail due to the 2nd etcd instance may have issue connecting to third etcd instance. This will start with updating second Master VM.
bosh -d <DEPLOYMENT-NAME> deploy <INSTANCE-NAME.yml> --fix
3.)The bosh deploy task failed again because the second master etcd failed to connect to the third master’s etcd instance.
4.) 'bosh ignore' the second master VM, and run the bosh deploy --fix command again. This should bring all Master VM's and workers back healthy.
5.) 'bosh unignore' the 2 masters that were ignored.
6.) Update the TKGi API to reflect succeeded with command: ‘tkgi upgrade-cluster', else will still show as "failed" when checking cluster status with command 'tkgi clusters'.