fluentbit pods "error registering chunk" log errors
search cancel

fluentbit pods "error registering chunk" log errors

book

Article ID: 375792

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

In TKGi environments where Log Sink resources are enabled in the TKGi tile > In-Cluster Monitoring, fluentbit pods in the pks-system namespace may show numerous log entries as follows:

fluent-bit [2024/08/06 11:45:56] [error] [input:emitter:emitter_for_multiline.0] error registering chunk with tag: kube.var.log.containers.fluent-bit-<>_pks-system_fluent-bit-<>.log

You may also notice intermittent CPU usage spikes and/or OOMKilled conditions in the pods.

Cause

This can be caused by a spike in logging by the applications monitored by fluentbit through LogSink/ClusterLogSink resources.
Most likely root cause is a saturation of the Memory Buffers at two different levels: fluentbit container level and in_emitter plugin level.

There's an upstream fluentbit known issue covering this case: https://github.com/fluent/fluent-bit/issues/8198

Resolution

The upstream fluentbit fix is included in v3.0.2: https://github.com/fluent/fluent-bit/pull/8473 
TKGi will include an updated fluentbit version including the fix in future releases (no ETA yet).

As a workaround it's suggested to increase the following Memory Limits:

  • fluentbit container Memory Limit from TKGi tile > In-Cluster Monitoring (Increase fluentbit container Memory Limit on TKGi tile)
    Then Apply Changes from OpsMan and upgrade the TKGi clusters.

  • emitter_mem_buf_limit in the fluent-bit ConfigMap in the pks-system namespace, under filter-multiline.conf (emitter_mem_buf_limit Docs)

    E.g.

      filter-multiline.conf: |
        [FILTER]
            Name                   multiline
            Match                  *
            multiline.key_content  log
            multiline.parser       go, java, python
            emitter_mem_buf_limit 50MB

    After editing the ConfigMap, restart the fluentbit pods with command: "kubectl rollout restart ds fluent-bit -n pks-system"

    Please note that the changes in the ConfigMap are not persistent across TKGi upgrades.

The values to which these limits need to be increased would depend on many factors, such as logging pressure from applications.
They are experienced ones, meaning its fine tuning can only be done through trial and error. If the initial values are not enough, please increase them until you see the problem resolved.