Kubelet failing with an error "Failed to allocate directory watch: Too many open files"
search cancel

Kubelet failing with an error "Failed to allocate directory watch: Too many open files"

book

Article ID: 375773

calendar_today

Updated On:

Products

Tanzu Kubernetes Grid

Issue/Introduction

You will observe worker nodes in TKGm management cluster going into NotReady State and kubelet logs display an error stating “Failed to allocate directory watch: Too many open files”

Environment

VMware Tanzu Kubernetes Grid v2.3
Telco Cloud Automation 2.1.1

Cause

/usr/sbin/syslog-ng-F processes on a node are highest consumer of  inotify instances.

These syslog-ng processes open maximum number of telemetry log files to send information to Vmware. As Customers environment is airgapped environment log files are open and not able to send information about the cluster to Vmware.

Resolution

The workaround for this issue is to Opt -out for the VMware CEIP using TKG CLI.

1. Export the kubeconfig targeting your management cluster to an environment variable as follows

    export KUBECONFIG=~/.kube/config

2. Run the tanzu telemetry participation update --CEIP-opt-out command.

    tanzu telemetry participation update --CEIP-opt-out
3. Verify that the CEIP participation is deactivated, run tanzu telemetry participation status. The status should now be disabled.
- ceip: |

level: disabled

shared_identifiers: ...     

4. Create a ConfigMap vmware-telemetry-cluster-ceip in tkg-system-telemetry namespace

kubectl apply -f - <<EOF

> apiVersion: v1

> kind: ConfigMap

> metadata:

> namespace: tkg-system-telemetry

> name: vmware-telemetry-cluster-ceip

> data:

> level: disabled

> EOF

5. Delete the cronjob tkg-telemetry. Before deleting take a backup of it. After which all the pods in Error state are deleted automatically.

kubectl get cronjob tkg-telemetry -oyaml -n tkg-system-telemetry > backup_cronjob_tkg-telemetry

kubectl delete job <job-name> -n tkg-system-telemetry

6. Increase the sysctl parameter fs.inotify.max_user_instances value from 128 to 256 on the Node which reports “NotReady” state by ssh into the node.

sudo sysctl fs.inotify.max_user_instances=256

7. Reboot the Node to make reflect changes. Verify node health , it must be in “Ready” state.

kubectl get nodes