TKGI worker node fails to run workloads
search cancel

TKGI worker node fails to run workloads

book

Article ID: 368130

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition VMware Tanzu Kubernetes Grid Integrated (TKGi) VMware Tanzu Kubernetes Grid Integrated Edition (Core) VMware Tanzu Kubernetes Grid Integrated Edition 1.x VMware Tanzu Kubernetes Grid Integrated EditionStarter Pack (Core)

Issue/Introduction

During normal operations on a single cluster one of the nodes is crashing without visible reason. 

After restart of the worker node the problem is fixed for short period of time usually a day then problem occurred again 

bosh -d <SI-UUID> restart <WORKER/Index> only stops and starts services but does not restart the OS, meaning the uptime will not be changed. This problem could be related to the persistent volume with possible causes:

Disk full or filesystem corrupted

Becasue the persistent volume will be preserved during VM reboot or recreation the above procedure will not solve the issue

bosh -d <SI-UUID> recreate <WORKER/Index> --fix        <--- will recreate the VM but will not touch the Persistent volume either.

Environment

TKGi 1.18.x 

Cause

Problem can be related to two possible issues:

1. Os related issue - which can be fixed with the bosh recreate command above

2. Persistent volume related issue where the additional steps will be required for the complete recreation of the VM

Resolution

To completely rotate all components of a worker (VM and Persistent volume :

1. Take a backup of the current manifest:

bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d manifest > service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d.yaml

2. Hard stop the VM  this command will drain the worker node and delete the VM selected

bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d stop --hard worker/ac2b31f8-59e1-4abb-bd93-036b1cf5a557

3. Examine the instances and confirm the VM is deleted and also get the disk UUID from the output with bold

bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d is --details

Deployment 'service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d'

Instance                                           Process State  AZ   IPs            Deployment                                             State     VM CID                                   VM Type      Disk CIDs                                  Agent ID                              Index  Bootstrap  Ignore  
apply-addons/2b6010e8-0a67-4ab3-a833-ed0ec29ecd48  -              az1  -              service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d  started   -                                        micro        -                                          -                                     0      true       false  
master/091a3f01-c189-4da5-bdc0-e920319b743e        running        az1  XX.XX.XX.XX  service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d  started   vm-6a1324ea-4b43-465c-9850-72a46ae617c6  xlarge       disk-6d1a9626-7c4f-459b-835b-687f45a38966  055b7f25-e67c-4875-8cd8-bd0535067e28  1      false      false  
worker/4f939862-4bce-486c-a395-37314e704fe2        running        az1  XX.XX.XX.XX  service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d  started   vm-48457ed5-7390-480f-95af-a871d0c83a50  xlarge.disk  disk-15d3c78d-bc92-4fa4-ad83-8c8d63a8b70e  a13f644b-6ca6-42ee-b71d-dac03ca53010  3      false      false  
worker/ac2b31f8-59e1-4abb-bd93-036b1cf5a557        -              az3  -              service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d  detached  -                                        xlarge.disk  disk-c776e188-d598-4af5-8a20-5f410d0b42bc  -                                     2      false      false  

9 instances

Succeeded

4. Orphan the disk assosiated with the VM 

bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d orphan-disk disk-c776e188-d598-4af5-8a20-5f410d0b42bc

5. Confirm the disk is orphaned:

bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d  disks --orphaned

Disk CID                                   Size    Deployment                                             Instance                                     AZ   Orphaned At  
disk-c776e188-d598-4af5-8a20-5f410d0b42bc  75 GiB  service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d  worker/ac2b31f8-59e1-4abb-bd93-036b1cf5a557  az3  Tue May 21 08:46:47 UTC 2024  

1 disks

Succeeded

6. Execute bosh start to recreate the VM and the disk 

bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d start worker/ac2b31f8-59e1-4abb-bd93-036b1cf5a557

7. After completion of the above task verify the new disk and VM ID are created:

bosh -d service-instance_79609ff9-0502-4abb-bc58-8e34e6c3c12d is --details