TKGI worker node fails to run workloads

Products

VMware Tanzu Kubernetes Grid Integrated Edition VMware Tanzu Kubernetes Grid Integrated (TKGi) VMware Tanzu Kubernetes Grid Integrated Edition (Core) VMware Tanzu Kubernetes Grid Integrated Edition 1.x VMware Tanzu Kubernetes Grid Integrated EditionStarter Pack (Core)

Issue/Introduction

During normal operations on a single cluster one of the nodes is crashing without visible reason.

After restart of the worker node the problem is fixed for short period of time usually a day then problem occurred again

bosh -d <SI-UUID> restart <WORKER/Index> only stops and starts services but does not restart the OS, meaning the uptime will not be changed. This problem could be related to the persistent volume with possible causes:

Disk full or filesystem corrupted

Becasue the persistent volume will be preserved during VM reboot or recreation the above procedure will not solve the issue

bosh -d <SI-UUID> recreate <WORKER/Index> --fix <--- will recreate the VM but will not touch the Persistent volume either.

Environment

TKGi 1.18.x

Cause

Problem can be related to two possible issues:

1. Os related issue - which can be fixed with the bosh recreate command above

2. Persistent volume related issue where the additional steps will be required for the complete recreation of the VM

Resolution

To completely rotate all components of a worker (VM and Persistent volume :

1. Take a backup of the current manifest:

bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d manifest > service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d.yaml

2. Hard stop the VM this command will drain the worker node and delete the VM selected

bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d stop --hard worker/ac2b31f8-59e1-4abb-bd93-036b1cf5a557

3. Examine the instances and confirm the VM is deleted and also get the disk UUID from the output with bold

bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d is --details

Deployment 'service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d'

Instance Process State AZ IPs Deployment State VM CID VM Type Disk CIDs Agent ID Index Bootstrap Ignore
apply-addons/2b6010e8-0a67-4ab3-a833-ed0ec29ecd48 - az1 - service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d started - micro - - 0 true false
master/091a3f01-c189-4da5-bdc0-e920319b743e running az1 XX.XX.XX.XX service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d started vm-6a1324ea-4b43-465c-9850-72a46ae617c6 xlarge disk-6d1a9626-7c4f-459b-835b-687f45a38966 055b7f25-e67c-4875-8cd8-bd0535067e28 1 false false
worker/4f939862-4bce-486c-a395-37314e704fe2 running az1 XX.XX.XX.XX service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d started vm-48457ed5-7390-480f-95af-a871d0c83a50 xlarge.disk disk-15d3c78d-bc92-4fa4-ad83-8c8d63a8b70e a13f644b-6ca6-42ee-b71d-dac03ca53010 3 false false
worker/ac2b31f8-59e1-4abb-bd93-036b1cf5a557 - az3 - service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d detached - xlarge.disk disk-c776e188-d598-4af5-8a20-5f410d0b42bc - 2 false false

9 instances

Succeeded

4. Orphan the disk assosiated with the VM

bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d orphan-disk disk-c776e188-d598-4af5-8a20-5f410d0b42bc

5. Confirm the disk is orphaned:

bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d disks --orphaned

Disk CID Size Deployment Instance AZ Orphaned At
disk-c776e188-d598-4af5-8a20-5f410d0b42bc 75 GiB service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d worker/ac2b31f8-59e1-4abb-bd93-036b1cf5a557 az3 Tue May 21 08:46:47 UTC 2024

1 disks

Succeeded

6. Execute bosh start to recreate the VM and the disk

bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d start worker/ac2b31f8-59e1-4abb-bd93-036b1cf5a557

7. After completion of the above task verify the new disk and VM ID are created:

bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d is --details