During normal operations on a single cluster one of the nodes is crashing without visible reason.
After restart of the worker node the problem is fixed for short period of time usually a day then problem occurred again
bosh -d <SI-UUID> restart <WORKER/Index> only stops and starts services but does not restart the OS, meaning the uptime will not be changed. This problem could be related to the persistent volume with possible causes:
Disk full or filesystem corrupted
Becasue the persistent volume will be preserved during VM reboot or recreation the above procedure will not solve the issue
bosh -d <SI-UUID> recreate <WORKER/Index> --fix <--- will recreate the VM but will not touch the Persistent volume either.
TKGi 1.18.x
Problem can be related to two possible issues:
1. Os related issue - which can be fixed with the bosh recreate command above
2. Persistent volume related issue where the additional steps will be required for the complete recreation of the VM
To completely rotate all components of a worker (VM and Persistent volume :
1. Take a backup of the current manifest:
bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d manifest > service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d.yaml
2. Hard stop the VM this command will drain the worker node and delete the VM selected
bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d stop --hard worker/ac2b31f8-59e1-4abb-bd93-036b1cf5a557
3. Examine the instances and confirm the VM is deleted and also get the disk UUID from the output with bold
bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d is --details
Deployment 'service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d'
Instance Process State AZ IPs Deployment State VM CID VM Type Disk CIDs Agent ID Index Bootstrap Ignore
apply-addons/2b6010e8-0a67-4ab3-a833-ed0ec29ecd48 - az1 - service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d started - micro - - 0 true false
master/091a3f01-c189-4da5-bdc0-e920319b743e running az1 XX.XX.XX.XX service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d started vm-6a1324ea-4b43-465c-9850-72a46ae617c6 xlarge disk-6d1a9626-7c4f-459b-835b-687f45a38966 055b7f25-e67c-4875-8cd8-bd0535067e28 1 false false
worker/4f939862-4bce-486c-a395-37314e704fe2 running az1 XX.XX.XX.XX service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d started vm-48457ed5-7390-480f-95af-a871d0c83a50 xlarge.disk disk-15d3c78d-bc92-4fa4-ad83-8c8d63a8b70e a13f644b-6ca6-42ee-b71d-dac03ca53010 3 false false
worker/ac2b31f8-59e1-4abb-bd93-036b1cf5a557 - az3 - service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d detached - xlarge.disk disk-c776e188-d598-4af5-8a20-5f410d0b42bc - 2 false false
9 instances
Succeeded
4. Orphan the disk assosiated with the VM
bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d orphan-disk disk-c776e188-d598-4af5-8a20-5f410d0b42bc
5. Confirm the disk is orphaned:
bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d disks --orphaned
Disk CID Size Deployment Instance AZ Orphaned At
disk-c776e188-d598-4af5-8a20-5f410d0b42bc 75 GiB service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d worker/ac2b31f8-59e1-4abb-bd93-036b1cf5a557 az3 Tue May 21 08:46:47 UTC 2024
1 disks
Succeeded
6. Execute bosh start to recreate the VM and the disk
bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d start worker/ac2b31f8-59e1-4abb-bd93-036b1cf5a557
7. After completion of the above task verify the new disk and VM ID are created:
bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d is --details