vSphere with Tanzu guest cluster worker nodes automatically rebuilt without human initiation
search cancel

vSphere with Tanzu guest cluster worker nodes automatically rebuilt without human initiation

book

Article ID: 399156

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • You notice that vSphere with Tanzu guest cluster worker nodes were rebuilt without any user interaction.
  • The capi-kubeadm-control-plane-controller-manager pods displays log entries similar to:
    [timestamp] stderr F [timestamp] machinehealthcheck_controller.go:434] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.#-###.io" controllerKind="MachineHealthCheck" MachineHealthCheck="###-###/###-###-###-######-##-#####" namespace="###-###" name="###-###-###-######-##-#####" reconcileID=########-####-####-####-############ Cluster="###-###/###-###-###" target="###-###/###-###-###-######-##-#####/###-###-###-######-##-#####-###############-#####/###-###-###-######-##-#####-###############-#####" reason="UnhealthyNode" message="Condition Ready on node is reporting status Unknown for more than 5m0s"
  • You can view the capi-kubeadm-control-plane-controller-manager pod logs by running the following while logged into the Supervisor cluster:
    kubectl logs -n <TKG_NAMESPACE> -l name=capi-kubeadm-control-plane-controller-manager -c manager
  • The guest cluster is now operating normally.

Environment

vSphere with Tanzu 8

Cause

vSphere with Tanzu utilizes machine health checks to automatically remediate Kubernetes nodes that are considered unhealthy. These checks include MemoryPressure, DiskPressure, PIDPressure and NetworkUnavailable. If any of the Kubernetes worker nodes experience any of these conditions are met for 5 minutes, they will be automatically be rebuilt/remediate. Reference Configure MachineHealthCheck for v1beta1 Clusters for more information.

Resolution

While Kubernetes will take action to automatically remediate guest clusters, it's essential to perform proper monitoring / maintenance of both the vSphere and Kubernetes environments to avoid unnecessary remediations. Kubernetes node remediation consumes CPU, network and disk resources within the vSphere environment.

VMware Cloud Foundation Operations can be used to monitor both the vSphere environment and Kubernetes clusters.