TKG clusters have been observed flipping between "Ready: True" and "Ready: False" due to the fluent-bit package being stuck in the Reconciling state.
search cancel

TKG clusters have been observed flipping between "Ready: True" and "Ready: False" due to the fluent-bit package being stuck in the Reconciling state.

book

Article ID: 392433

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

TKG ready state shows as false every few minutes.

kubectl get tkc

NAME          CONTROL PLANE   WORKER   TKR NAME                           AGE    READY   TKR COMPATIBLE   UPDATES AVAILABLE
tanzu-xxxxx   3              5        v1.26.12---vmware.2-fips.1-tkg.2   441d   False   True

tanzu package installed list -A

NAMESPACE      NAME          PACKAGE-NAME                   PACKAGE-VERSION        STATUS
tanzu-package  cert-manager  cert-manager.tanzu.vmware.com  1.12.2+vmware.1-tkg.1  Reconcile succeeded
tanzu-package  contour       contour.tanzu.vmware.com       1.25.2+vmware.1-tkg.1  Reconcile succeeded
tanzu-package  fluent-bit    fluent-bit.tanzu.vmware.com    1.9.5+vmware.1-tkg.2   Reconcile failed
tanzu-package  grafana       grafana.tanzu.vmware.com       10.0.1+vmware.1-tkg.1  Reconcile succeeded
tanzu-package  prometheus    prometheus.tanzu.vmware.com    2.37.0+vmware.3-tkg.1  Reconcile succeeded

 

Fluentd Log Snippet Indicating an Issue with Elasticsearch:

2025-02-21T08:26:24.019008509Z stdout F 2025-02-21 08:26:24 +0000 [warn]: #0 [in_tail_container_logs] Pattern not matched:  
"2025-02-21T08:11:10.006873071Z stderr F {\"log.level\":\"error\", \"message\":\"Failed to connect to backoff(elasticsearch(http://10.190.1.9:9200)): Connection marked as failed because the onConnect callback failed: 429 Too Many Requests\"}"

Environment

VMware vSphere with Tanzu

Cause

Issue is caused by Fluent Bit being unable to connect to Elasticsearch, resulting in a Reconcile failed status.

Fluent Bit needs to establish a connection with Elasticsearch server for log forwarding. If network is unavailable, Fluent Bit detects error and enters Reconciling state.

Resolution

This issue is caused by a known Antrea bug.

According to release notes, TKG clusters using Antrea package v1.11.1 may randomly enter ClusterBootstrapReconciling state, causing networking issues. In this case, TKC 1.26.12 is using Antrea version 1.11.1.

vSphere Supervisor 7.0 Release Notes

 

Workarounds:

1. Upgrade TKG Cluster

Upgrade to TKR v1.26.13 with Antrea version v1.11.2 or later to resolve the issue.

2. Remove Fluent Bit Package (if not used as a log forwarder)

If the customer does not use Fluent Bit for log forwarding, they can remove it to stabilize TKC cluster.

Steps to remove Fluent Bit:
  1. List the installed Fluent Bit package:

    kubectl get pkgi -A | grep fluent
  2. Delete the installed Fluent Bit package:

     
    kubectl delete pkgi -n <fluent-bit-package-namespace> <fluent-bit-package-name>
  3. Verify Fluent Bit has been removed:

     
    kubectl get pkgi -A | grep fluent
     
    No resources found