If one has deleted worker VM directly from vCenter along with VM disk then TKGi worker VM will go in "unresponsive agent". Also BOSH won't be able to recover it.
BOSH will start to see error similar like below :-
{"time":1753437838,"error":{"code":450002,"message":"Timed out sending ''get_state'' to instance: ''worker/XXXXXXX'', agent-id: ''36785af9-407a-45f4-bf3d-39557fc8fe88'' after 45 seconds"}}
TKGI v1.x
Recovery Steps:-
1) "Power off" the BOSH resurrection
bosh update-resurrection off
1) Delete VM and disk reference using cck
bosh cck -d service-instance_XXXXXXXX
Using environment 'X.X.X.X' as client 'ops_manager'
Using deployment 'service-instance_XXXXXXXXX'
Task 167
Task 167 | 21:53:27 | Scanning 8 VMs: Checking VM states (00:00:21)
Task 167 | 21:53:48 | Scanning 8 VMs: 7 OK, 0 unresponsive, 1 missing, 0 unbound (00:00:00)
Task 167 | 21:53:48 | Scanning 8 persistent disks: Looking for inactive disks (00:00:42)
Task 167 | 21:54:30 | Scanning 8 persistent disks: 7 OK, 1 missing, 0 inactive, 0 mount-info mismatch (00:00:00)
Task 167 Started Sat Sep 10 21:53:27 UTC 2022
Task 167 Finished Sat Sep 10 21:54:30 UTC 2022
Task 167 Duration 00:01:03
Task 167 done
# Type Description
1 missing_vm VM for 'worker/XXXXXXXXXXXXXXX (1)' with cloud ID 'vm-XXXXXXXXXXXXX' missing.
2 missing_disk Disk 'disk-XXXXXXXXXX' (worker/XXXXXXXXXXXXXX, 20480M) is missing
2 problems
1: Skip for now
2: Recreate VM without waiting for processes to start
3: Recreate VM and wait for processes to start
4: Delete VM reference
VM for 'worker/XXXXXXXXXXXXXXX (1)' with cloud ID 'vm-XXXXXXXXXXXX' missing. (1): 4
1: Skip for now
2: Delete disk reference (DANGEROUS!)
Disk 'disk-XXXXXXXXXXX' (worker/XXXXXXXXXXXXXXX 20480M) is missing (1): 2
Continue? [yN]: y
2) Then allow BOSH task to complete the VM and disk reference deletion.
3) Now fetch TKGI service-instance manifest.
bosh manifest -d service-instance_XXXXXXXXXXXXXX > service-instance.yaml
4) To recover missing worker, deploy the manifest again.
bosh -d service-instance_XXXXXXXX deploy service-instance.yaml
5) Check now if service-instance has correct number of workers in "Running" State.
6) If all up and running then we are good to "power on" the resurrection back again.
bosh update-resurrection on
Note: The recovery steps should be only used for worker nodes. The bosh persistent disk on worker nodes only has pod ephemeral data and container images. Ephemeral data can be regenerated by applications when pods are scheduled on a worker. Similarly container images can be re-fetched when pods are scheduled on a worker.