TKGi Recreate a new Worker node with fresh Persistent Disk

search cancel

TKGi Recreate a new Worker node with fresh Persistent Disk

book

Article ID: 379130

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Scenario

When the IaaS layer causes critical damage to the TKGi worker node persistent disk, you can see those error messages when the specific worker node is powered on.

bosh task XXXX --cpi

E, [2024-10-07T04:27:17.392073 #4002] ERROR -- [req_id cpi-882029]: Worker thread raised exception: Successfully found disk '#<VSphereCloud::Resources::PersistentDisk:0x00007ff6b80c6960>' (this is not an error) - /var/vcap/data/packages/vsphere_cpi/1bac8787de02268e969811f0df90c0ad2b6b963f/lib/cloud/vsphere/resources/datacenter.rb:200:in `block (2 levels) in find_disk_cid_in_datastores'

vCenter UI Task Message 1

Failed to lock the file Failed to start the virtual machine. Module Disk power on failed. Cannot open the disk '/vmfs/volumes/vsan:52b67c27d0a854e2-1f2b2043cb8e4252/cb7ed15f-367f-1011-406a-e4434b2fb688/disk-e63e72cc-2e4d-448a-8133-11366acf89d1.vmdk' or one of the snapshot disks it depends on.

vCenter UI Task Message 2

Error caused by file /vmfs/volumes/vsan:52a92eec7ea79648-2d33ecdf70afd95d/d92f3a65-0e40-1565-b3a0-78ac44a96294/disk-e375047f-1aa2-4e24-949c-e0447063dcc2.vmdk

vCenter UI Task Message 3

Failed to add disk scsi0:3. Failed to lock the file Cannot open the disk '/vmfs/volumes/vsan:52b67c27d0a854e2-1f2b2043cb8e4252/cb7ed15f-367f-1011-406a-e4434b2fb688/disk-4a6dbd67-8ac1-433d-80b2-604d573a45f7.vmdk' or one of the snapshot disks it depends on. Failed to power on scsi0:3.

Notification

Use this KB only as a last resort after all other recovery attempts have failed.

bosh recreate
bosh cck
KB 369366 - Repairing Bosh Created Persistent Disks of TKGI Worker Nodes

Target is worker node only. Don't apply against the master node.

Environment

All Versions of VMware Tanzu Kubernetes Grid Integrated Edition.

Cause

Persistent disk corruption can occur due to underlying IaaS troubles.

Resolution

Recreate a new worker node with a new persistent disk.

1. Preparation

# Resurrection: OFF
bosh update-resurrection off 
bosh curl /resurrection

# Set the parameters
bosh vms
SERVICE_INSTANCE=service-instance_e1849014-e334-42b2-81c9-xxxxxxxxxxxx

# Target is worker node.  Don't set master node
bosh -d ${SERVICE_INSTANCE} is --details --column=Instance --column=Index --column='Process State' --column='Disk CIDs' --column='VM CID'

VM_CID=vm-b4dea926-ec08-408d-a219-xxxxxxx
DISK_CID=disk-f96d5e9e-3572-47ef-8c40-xxxxxx

2. Delete the target worker node

# Delete the Worker node
bosh -d ${SERVICE_INSTANCE} delete-vm ${VM_CID}

# Detach the Persistent-disk
bosh -d ${SERVICE_INSTANCE} orphan-disk ${DISK_CID}

# Check
bosh -d ${SERVICE_INSTANCE} is --details --column=Instance --column=Index --column='Process State' --column='Disk CIDs' --column='VM CID'

3. Recreate a new worker node with a new persistent disk

bosh -d ${SERVICE_INSTANCE} manifest > ${SERVICE_INSTANCE}.yaml
bosh -d ${SERVICE_INSTANCE} deploy ${SERVICE_INSTANCE}.yaml --fix --skip-drain

4. Closing work

# Check
bosh -d ${SERVICE_INSTANCE} vms

# Resurrection: ON
bosh update-resurrection on
bosh curl /resurrection

Additional Information

BOSH - Persistent Disk
- Detached persistent disk will be cleaned up after 5 days automatically

Feedback

thumb_up Yes

thumb_down No