Recover TKGI worker after VM is manually deleted from vCenter along with its disks
search cancel

Recover TKGI worker after VM is manually deleted from vCenter along with its disks

book

Article ID: 417305

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition VMware Tanzu Kubernetes Grid Integrated Edition (Core) VMware Tanzu Kubernetes Grid Integrated EditionStarter Pack (Core)

Issue/Introduction

If one has deleted worker VM directly from vCenter along with VM disk then TKGi worker VM will go in "unresponsive agent". Also BOSH won't be able to recover it.

BOSH will start to see error similar like below :-

{"time":1753437838,"error":{"code":450002,"message":"Timed out sending ''get_state'' to instance: ''worker/XXXXXXX'', agent-id: ''36785af9-407a-45f4-bf3d-39557fc8fe88'' after 45 seconds"}}

Environment

TKGI v1.x

Resolution

Recovery Steps:-

1) "Power off" the BOSH resurrection

bosh update-resurrection off

1) Delete VM and disk reference using cck

bosh cck -d service-instance_XXXXXXXX
Using environment 'X.X.X.X' as client 'ops_manager'

Using deployment 'service-instance_XXXXXXXXX'

Task 167

Task 167 | 21:53:27 | Scanning 8 VMs: Checking VM states (00:00:21)
Task 167 | 21:53:48 | Scanning 8 VMs: 7 OK, 0 unresponsive, 1 missing, 0 unbound (00:00:00)
Task 167 | 21:53:48 | Scanning 8 persistent disks: Looking for inactive disks (00:00:42)
Task 167 | 21:54:30 | Scanning 8 persistent disks: 7 OK, 1 missing, 0 inactive, 0 mount-info mismatch (00:00:00)

Task 167 Started  Sat Sep 10 21:53:27 UTC 2022
Task 167 Finished Sat Sep 10 21:54:30 UTC 2022
Task 167 Duration 00:01:03
Task 167 done

#  Type          Description
1  missing_vm    VM for 'worker/XXXXXXXXXXXXXXX (1)' with cloud ID 'vm-XXXXXXXXXXXXX' missing.
2  missing_disk  Disk 'disk-XXXXXXXXXX' (worker/XXXXXXXXXXXXXX, 20480M) is missing

2 problems

1: Skip for now
2: Recreate VM without waiting for processes to start
3: Recreate VM and wait for processes to start
4: Delete VM reference
VM for 'worker/XXXXXXXXXXXXXXX (1)' with cloud ID 'vm-XXXXXXXXXXXX' missing. (1): 4

1: Skip for now
2: Delete disk reference (DANGEROUS!)
Disk 'disk-XXXXXXXXXXX' (worker/XXXXXXXXXXXXXXX 20480M) is missing (1): 2

Continue? [yN]: y

 

2) Then allow BOSH task to complete the VM and disk reference deletion.

3) Now fetch TKGI service-instance manifest.

bosh manifest -d service-instance_XXXXXXXXXXXXXX > service-instance.yaml

4)  To recover missing worker, deploy the manifest again.

bosh -d service-instance_XXXXXXXX deploy service-instance.yaml

5) Check now if service-instance has correct number of workers in "Running" State.

6) If all up and running then we are good to "power on" the resurrection back again.

bosh update-resurrection on

 

 

Note: The recovery steps should be only used for worker nodes. The bosh persistent disk on worker nodes only has pod ephemeral data and container images. Ephemeral data can be regenerated by applications when pods are scheduled on a worker. Similarly container images can be re-fetched when pods are scheduled on a worker.