fluent-bit pod in CrashLoopBackOff state
search cancel

fluent-bit pod in CrashLoopBackOff state

book

Article ID: 415350

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Fluent bit pod in TKGI namespace pks-system is in CrashLoopBackOff state at one or more worker nodes with error log similar to:

[error] [/tmp/fluent-bit-xyz/plugins/in_tail/tail_fs_inotify.c:360 errno=24] Too many open files
[error] failed initialize input tail.0
[error] [engine] input initialization failed
[error] [lib] backend failed

Environment

Tanzu Kubernetes Grid Integrated Edition 

Cause

Fluent Bit pod crashes because it exceeds the worker node inotify/open-file limits while tailing cluster's log files.

Resolution

Raise the inotify and file-handle sysctl on fluent-bit pod OS. 

In order to make a permanent configuration change, create a runtime config which will be applied to bosh. 

vim runtime-config.yaml
-----
releases:
- name: "os-conf"
  version: "23.0.0"
addons:
- name: fluent-bit-os-max-config
  jobs:
  - name: sysctl
    release: os-conf
    properties:
      sysctl:
      - fs.inotify.max_user_watches=524288
      - fs.inotify.max_user_instances=16383
      - fs.inotify.max_queued_events=524288
      - fs.file-max=2097152
  include:
    deployments: [service-instance_xxx-yyy]# Optional, you can define which deployments (TKGi clusters) this runtime config will be applied to.
    instance_groups: [<master and/or worker, as defined in the deployment manifest>]    # Optional, you can define which instance_groups (cluster nodes, i.e. masters/workers) this runtime config will be applied to.
  exclude:    
    deployments: [<service-instance_XXXXXXXXXX>]                                        # Optional, you can define which deployments (TKGi clusters) this runtime config will not be applied to.
    instance_groups: [<master and/or worker, as defined in the deployment manifest>]    # Optional, you can define which instance_groups (cluster nodes, i.e. masters/workers) this runtime config will not be applied to.

# Update bosh configs:

bosh update-config --type=runtime --name fluent-bit-os-max runtime-config.yaml

# Get the service instance manifest, where the fluentbit pod was having issues and re-deploy it:

bosh -d service-instance_xxx-yyy-zzz manifest > service-instance_xxx-yyy-zzz.yaml
bosh -d service-instance_xxx-yyy-zzz deploy service-instance_xxx-yyy-zzz.yaml