You will observe worker nodes in TKGm management cluster going into NotReady State and kubelet logs display an error stating “Failed to allocate directory watch: Too many open files”
VMware Tanzu Kubernetes Grid v2.3
Telco Cloud Automation 2.1.1
/usr/sbin/syslog-ng-F processes on a node are highest consumer of inotify instances.
These syslog-ng processes open maximum number of telemetry log files to send information to Vmware. As Customers environment is airgapped environment log files are open and not able to send information about the cluster to Vmware.
The workaround for this issue is to Opt -out for the VMware CEIP using TKG CLI.
1. Export the kubeconfig targeting your management cluster to an environment variable as follows
export KUBECONFIG=~/.kube/config
2. Run the tanzu telemetry participation update --CEIP-opt-out command.
tanzu telemetry participation update --CEIP-opt-out
3. Verify that the CEIP participation is deactivated, run tanzu telemetry participation status. The status should now be disabled.
- ceip: |
level: disabled
shared_identifiers: ...
4. Create a ConfigMap vmware-telemetry-cluster-ceip in tkg-system-telemetry namespace
kubectl apply -f - <<EOF
> apiVersion: v1
> kind: ConfigMap
> metadata:
> namespace: tkg-system-telemetry
> name: vmware-telemetry-cluster-ceip
> data:
> level: disabled
> EOF
5. Delete the cronjob tkg-telemetry. Before deleting take a backup of it. After which all the pods in Error state are deleted automatically.
kubectl get cronjob tkg-telemetry -oyaml -n tkg-system-telemetry > backup_cronjob_tkg-telemetry
kubectl delete job <job-name> -n tkg-system-telemetry
6. Increase the sysctl parameter fs.inotify.max_user_instances value from 128 to 256 on the Node which reports “NotReady” state by ssh into the node.
sudo sysctl fs.inotify.max_user_instances=256
7. Reboot the Node to make reflect changes. Verify node health , it must be in “Ready” state.
kubectl get nodes