Scenario
When the IaaS layer causes critical damage to the TKGi worker node persistent disk, you can see those error messages when the specific worker node is powered on.
bosh task XXXX --cpi
E, [2024-10-07T04:27:17.392073 #4002] ERROR -- [req_id cpi-882029]: Worker thread raised exception: Successfully found disk '#<VSphereCloud::Resources::PersistentDisk:0x00007ff6b80c6960>' (this is not an error) - /var/vcap/data/packages/vsphere_cpi/1bac8787de02268e969811f0df90c0ad2b6b963f/lib/cloud/vsphere/resources/datacenter.rb:200:in `block (2 levels) in find_disk_cid_in_datastores'
vCenter UI Task Message 1
Failed to lock the file Failed to start the virtual machine. Module Disk power on failed. Cannot open the disk '/vmfs/volumes/vsan:52b67c27d0a854e2-1f2b2043cb8e4252/cb7ed15f-367f-1011-406a-e4434b2fb688/disk-e63e72cc-2e4d-448a-8133-11366acf89d1.vmdk' or one of the snapshot disks it depends on.
vCenter UI Task Message 2
Error caused by file /vmfs/volumes/vsan:52a92eec7ea79648-2d33ecdf70afd95d/d92f3a65-0e40-1565-b3a0-78ac44a96294/disk-e375047f-1aa2-4e24-949c-e0447063dcc2.vmdk
vCenter UI Task Message 3
Failed to add disk scsi0:3. Failed to lock the file Cannot open the disk '/vmfs/volumes/vsan:52b67c27d0a854e2-1f2b2043cb8e4252/cb7ed15f-367f-1011-406a-e4434b2fb688/disk-4a6dbd67-8ac1-433d-80b2-604d573a45f7.vmdk' or one of the snapshot disks it depends on. Failed to power on scsi0:3.
Notification
Target is worker node only. Don't apply against the master node.
All Versions of VMware Tanzu Kubernetes Grid Integrated Edition.
Persistent disk corruption can occur due to underlying IaaS troubles.
Recreate a new worker node with a new persistent disk.
1. Preparation
# Resurrection: OFF
bosh update-resurrection off
bosh curl /resurrection
# Set the parameters
bosh vms
SERVICE_INSTANCE=service-instance_e1849014-e334-42b2-81c9-xxxxxxxxxxxx
# Target is worker node. Don't set master node
bosh -d ${SERVICE_INSTANCE} is --details --column=Instance --column=Index --column='Process State' --column='Disk CIDs' --column='VM CID'
VM_CID=vm-b4dea926-ec08-408d-a219-xxxxxxx
DISK_CID=disk-f96d5e9e-3572-47ef-8c40-xxxxxx
2. Delete the target worker node
# Delete the Worker node
bosh -d ${SERVICE_INSTANCE} delete-vm ${VM_CID}
# Detach the Persistent-disk
bosh -d ${SERVICE_INSTANCE} orphan-disk ${DISK_CID}
# Check
bosh -d ${SERVICE_INSTANCE} is --details --column=Instance --column=Index --column='Process State' --column='Disk CIDs' --column='VM CID'
3. Recreate a new worker node with a new persistent disk
bosh -d ${SERVICE_INSTANCE} manifest > ${SERVICE_INSTANCE}.yaml
bosh -d ${SERVICE_INSTANCE} deploy ${SERVICE_INSTANCE}.yaml --fix --skip-drain
4. Closing work
# Check
bosh -d ${SERVICE_INSTANCE} vms
# Resurrection: ON
bosh update-resurrection on
bosh curl /resurrection