TKG clusters have been observed flipping between "Ready: True" and "Ready: False" due to the fluent-bit package being stuck in the Reconciling state.
search cancel

TKG clusters have been observed flipping between "Ready: True" and "Ready: False" due to the fluent-bit package being stuck in the Reconciling state.

book

Article ID: 392433

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • TKG ready state flips to False every few minutes.

    kubectl get tkc

    NAME          CONTROL PLANE   WORKER   TKR NAME                           AGE    READY   TKR COMPATIBLE   UPDATES AVAILABLE
    tanzu-xxxxx   3              5        v1.26.12---vmware.2-fips.1-tkg.2   441d   False   True

  • Verify the installed packages. It is noted that fluent-bit is in 'Reconcile failed' state.

    tanzu package installed list -A

    NAMESPACE      NAME          PACKAGE-NAME                   PACKAGE-VERSION        STATUS
    tanzu-package  cert-manager  cert-manager.tanzu.vmware.com  1.12.2+vmware.1-tkg.1  Reconcile succeeded
    tanzu-package  contour       contour.tanzu.vmware.com       1.25.2+vmware.1-tkg.1  Reconcile succeeded
    tanzu-package  fluent-bit    fluent-bit.tanzu.vmware.com    1.9.5+vmware.1-tkg.2   Reconcile failed
    tanzu-package  grafana       grafana.tanzu.vmware.com       10.0.1+vmware.1-tkg.1  Reconcile succeeded
    tanzu-package  prometheus    prometheus.tanzu.vmware.com    2.37.0+vmware.3-tkg.1  Reconcile succeeded

  • Verifying the fluentd logs, an Issue with Elasticsearch is seen:

    YYYY-MM-DDTHH:MM:SSZ stdout [warn]: #0 [in_tail_container_logs] Pattern not matched:  
    "YYYY-MM-DDTHH:MM:SSZ stderr F {\"log.level\":\"error\", \"message\":\"Failed to connect to backoff(elasticsearch(http://<IP>:9200)): Connection marked as failed because the onConnect callback failed: 429 Too Many Requests\"}"

Environment

vSphere Kubernetes Service

Cause

  • Fluent Bit is unable to connect to Elasticsearch, resulting in a Reconcile failed status.
  • Fluent Bit needs to establish a connection with Elasticsearch server for log forwarding. If network is unavailable, Fluent Bit detects error and enters Reconciling state.
  • This issue is caused by a known Antrea bug.
  • According to release notes, TKG clusters using Antrea package v1.11.1 may randomly enter ClusterBootstrapReconciling state, causing networking issues. 

Resolution

Workarounds:

1. Upgrade TKG Cluster

  • Upgrade to TKR with Antrea version v1.11.2 or later to resolve the issue.

2. Remove Fluent Bit Package (if not used as a log forwarder)

  1. List the installed Fluent Bit package:

    kubectl get pkgi -A | grep fluent
  2. Delete the installed Fluent Bit package:

    kubectl delete pkgi -n <fluent-bit-package-namespace> <fluent-bit-package-name>
  3. Verify Fluent Bit has been removed:

    kubectl get pkgi -A | grep fluent
    No resources found