TKG clusters have been observed flipping between "Ready: True" and "Ready: False" due to the fluent-bit package being stuck in the Reconciling state.

search cancel

book

calendar_today

VMware vSphere Kubernetes Service

TKG ready state flips to False every few minutes.

kubectl get tkc

NAME CONTROL PLANE WORKER TKR NAME AGE READY TKR COMPATIBLE UPDATES AVAILABLE
tanzu-xxxxx 3 5 v1.26.12---vmware.2-fips.1-tkg.2 441d False True
Verify the installed packages. It is noted that fluent-bit is in 'Reconcile failed' state.

tanzu package installed list -A

NAMESPACE NAME PACKAGE-NAME PACKAGE-VERSION STATUS
tanzu-package cert-manager cert-manager.tanzu.vmware.com 1.12.2+vmware.1-tkg.1 Reconcile succeeded
tanzu-package contour contour.tanzu.vmware.com 1.25.2+vmware.1-tkg.1 Reconcile succeeded
tanzu-package fluent-bit fluent-bit.tanzu.vmware.com 1.9.5+vmware.1-tkg.2 Reconcile failed
tanzu-package grafana grafana.tanzu.vmware.com 10.0.1+vmware.1-tkg.1 Reconcile succeeded
tanzu-package prometheus prometheus.tanzu.vmware.com 2.37.0+vmware.3-tkg.1 Reconcile succeeded
Verifying the fluentd logs, an Issue with Elasticsearch is seen:

YYYY-MM-DDTHH:MM:SSZ stdout [warn]: #0 [in_tail_container_logs] Pattern not matched:
"YYYY-MM-DDTHH:MM:SSZ stderr F {\"log.level\":\"error\", \"message\":\"Failed to connect to backoff(elasticsearch(http://<IP>:9200)): Connection marked as failed because the onConnect callback failed: 429 Too Many Requests\"}"

vSphere Kubernetes Service

Fluent Bit is unable to connect to Elasticsearch, resulting in a Reconcile failed status.
Fluent Bit needs to establish a connection with Elasticsearch server for log forwarding. If network is unavailable, Fluent Bit detects error and enters Reconciling state.
This issue is caused by a known Antrea bug.
According to release notes, TKG clusters using Antrea package v1.11.1 may randomly enter ClusterBootstrapReconciling state, causing networking issues.

List the installed Fluent Bit package:
kubectl get pkgi -A | grep fluent
Delete the installed Fluent Bit package:
kubectl delete pkgi -n <fluent-bit-package-namespace> <fluent-bit-package-name>
Verify Fluent Bit has been removed:
kubectl get pkgi -A | grep fluent
No resources found

thumb_up Yes

thumb_down No