TKGI: Bosh resurrector is not re-creating the K8S nodes even though node health status shows "unresponsive agent"
search cancel

TKGI: Bosh resurrector is not re-creating the K8S nodes even though node health status shows "unresponsive agent"

book

Article ID: 345549

calendar_today

Updated On:

Products

VMware Cloud PKS VMware Tanzu Kubernetes Grid Integrated (TKGi) VMware Tanzu Kubernetes Grid Integrated Edition VMware Tanzu Kubernetes Grid Integrated Edition (Core) VMware Tanzu Kubernetes Grid Integrated Edition 1.x VMware Tanzu Kubernetes Grid Integrated EditionStarter Pack (Core)

Issue/Introduction

Symptoms:

  • When you run the command, bosh vms, you see some TKGI Control Plane or Kubernetes cluster nodes with Process State of unresponsive agent.

NOTE: These VMs could be TKGI Control Plane VMs or Kubernetes cluster VMs.

Example: Below uses bosh CLI to check VM states for a Kubernetes cluster with CLUSTER UUID VVVVVVVV-WWWW-XXXX-YYYY-ZZZZZZZZZZZZ:

$ bosh -d service-instance_VVVVVVVV-WWWW-XXXX-YYYY-ZZZZZZZZZZZZ vms
Instance                                           Process State       AZ      IPs           VM CID                                                                              VM Type           Active  Stemcell
master/16637df3-aa5c-49a3-9824-c10c83173908             running                 AZ1     <IP-REDACTED>      vm-0e98b6ae-d00e-4ab0-a3df-487ea8115606 xlarge  true    bosh-vsphere-esxi-ubuntu-jammy-go_agent/1.360
master/60f9453b-5e59-412e-aa62-143c8d2cac57             running                 AZ2     <IP-REDACTED>      vm-faee689c-2f75-48b7-ba37-d37e014f6745 xlarge  true    bosh-vsphere-esxi-ubuntu-jammy-go_agent/1.360
master/657f6783-a003-4c90-aa19-78b161299658             running                 AZ3     <IP-REDACTED>      vm-764247f9-0a27-468d-918a-d9c2d31f8888 xlarge  true    bosh-vsphere-esxi-ubuntu-jammy-go_agent/1.360
worker-wf-4xlarge/094ff4c0-9872-42f1-9e58-dce164323ca3  unresponsive agent      AZ2     <IP-REDACTED>     vm-3f189327-1f19-4dcf-9361-511b78fa4ce4 -       true    bosh-vsphere-esxi-ubuntu-jammy-go_agent/1.360
worker-wf-4xlarge/0ec4ee91-9831-44d4-a09d-52d364bc57ca  unresponsive agent      AZ3     <IP-REDACTED>     vm-bfcc1cc9-b1ef-43f8-9fe9-708dd4c618a1 -       true    bosh-vsphere-esxi-ubuntu-jammy-go_agent/1.360
worker-wf-4xlarge/1c2310c3-8f5f-46c5-b7bd-25505fb7d12f  unresponsive agent      AZ2     <IP-REDACTED>     vm-c1ceee48-59db-4b1e-9e5d-62f7e05d4a08 -       true    bosh-vsphere-esxi-ubuntu-jammy-go_agent/1.360
worker-wf-4xlarge/2632be2c-a047-4ae0-b11e-652be1665dad  unresponsive agent      AZ3     <IP-REDACTED>     vm-9293a0bd-8ad5-4083-8714-db0be982ef15 -       true    bosh-vsphere-esxi-ubuntu-jammy-go_agent/1.360
worker-wf-4xlarge/2a79da92-bf5a-46e6-8cfd-633417613581  unresponsive agent      AZ2     <IP-REDACTED>      vm-a63350c8-e7a0-4aac-a7b3-4b206fd9b563 -       true    bosh-vsphere-esxi-ubuntu-jammy-go_agent/1.360
worker-wf-4xlarge/2b7fde58-fd3b-4dc8-b057-2e1f94798911  unresponsive agent      AZ3     <IP-REDACTED>     vm-850b8574-98dc-473d-92e7-4f4c0e3ff31f -       true    bosh-vsphere-esxi-ubuntu-jammy-go_agent/1.360
worker-wf-4xlarge/2c76746a-be6b-4a06-8488-43df869e76b0  unresponsive agent      AZ3     <IP-REDACTED>     vm-65d93f74-aac0-4540-af5b-b5cddef937d4 -       true    bosh-vsphere-esxi-ubuntu-jammy-go_agent/1.360
worker-wf-4xlarge/325adb08-bfcc-44f0-b4bd-3810c56d5b4a  unresponsive agent      AZ2     <IP-REDACTED>      vm-fe9c47ee-e2bf-435a-9e92-682650249515 -       true    bosh-vsphere-esxi-ubuntu-jammy-go_agent/1.360
worker-wf-4xlarge/3ccad40f-1ba1-466e-af7f-a19d71326628  unresponsive agent      AZ2     <IP-REDACTED>     vm-4371b645-7d7a-40c0-931f-65f47a19f47d -       true    bosh-vsphere-esxi-ubuntu-jammy-go_agent/1.360
worker-wf-4xlarge/3cfe5cf6-23ee-4f48-80fd-6a0f0cfd3df5  unresponsive agent      AZ3     <IP-REDACTED>     vm-2e7f124c-485a-4d88-8223-da005dc62fd5 -       true    bosh-vsphere-esxi-ubuntu-jammy-go_agent/1.360
worker-wf-4xlarge/50443100-93c6-406f-ab1d-54a1bc739bf2  unresponsive agent      AZ3     <IP-REDACTED>     vm-a5fd7b92-09c4-4968-9815-96682a791238 -       true    bosh-vsphere-esxi-ubuntu-jammy-go_agent/1.360
worker-wf-4xlarge/51267837-bc58-4df4-9e20-eaaa7c864eef  unresponsive agent      AZ1     <IP-REDACTED>     vm-a3948027-a162-497a-932f-9cfd04663649 -       true    bosh-vsphere-esxi-ubuntu-jammy-go_agent/1.360
worker-wf-4xlarge/5b3651d3-3ae3-4d0b-ab84-85e5a9327785  unresponsive agent      AZ1     <IP-REDACTED>     vm-6a4b343c-c3e9-45c5-8b4a-3fae5d5ed10b -       true    bosh-vsphere-esxi-ubuntu-jammy-go_agent/1.360
worker-wf-4xlarge/619b576f-c372-4cf2-954e-84a6a0280219  unresponsive agent      AZ1     <IP-REDACTED>     vm-4ea78bf2-fe9c-4600-a7a5-49edeac1fdad -       true    bosh-vsphere-esxi-ubuntu-jammy-go_agent/1.360

 

  • Check if bosh resurrector is turned off "globally" via bosh CLI

NOTE: This checks is the "global" status of the setting.  There can be specific resurrector settings at a bosh deployment level

                     Also, the resurrector status results from the command line are not the same as from the Opsmanager UI.  

bosh curl /resurrection

Environment

TKGI

Cause

This issue occurs when either BOSH Resurrector Plugin (VM Resurrector) is not enabled in the Bosh Director tile or vSphere DRS is not automatic.

When vCenter DRS setting have been changed to disabled / turned off / manual, it effects the Bosh resurrector from working correctly. The Resurrector is unable to delete and terminate the VM's and then have them moved or have them rebuilt correctly. 

Resolution

To resolve this issue you can take several actions.  Some of which include:

  • Enable the Bosh Resurrector within the Opsmanager UI and set DRS to Automatic.

  • Check current "global" setting for the resurrector is also enabled from bosh CLI.  If not, turn it on

bosh curl /resurrection

bosh update-resurrection on


NOTE: This should allow for BOSH and the resurrector to fix the nodes and the bosh agent should respond again.

OR