BOSH reports all jobs healthy but kubectl says nodes "Not Ready" in Tanzu Kubernetes Grid Integrated Edition with Docker runtime

search cancel

BOSH reports all jobs healthy but kubectl says nodes "Not Ready" in Tanzu Kubernetes Grid Integrated Edition with Docker runtime

book

Article ID: 298486

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

bosh -d <SERVICE_INSTANCE> is --ps reports all Tanzu Kubernetes Grid Integrated Edition (TKGI) cluster jobs are healthy. However, kubectl get nodes reports some worker nodes are "Not Ready".

Kubelet logs under /var/vcap/sys/log/kubelet keeps repeating the following error:

E1208 13:21:33.114542   11981 remote_runtime.go:277] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (16848940 vs. 16777216)
E1208 13:21:33.114542   11981 kuberuntime_container.go:395] getKubeletContainers failed: rpc error: code = ResourceExhaused desc = grpc: trying to send message larger than max (16848940 vs. 16777216)
E1208 13:21:33.114542   11981 generic.go:205] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhaused desc = grpc: trying to send message larger than max (16848940 vs. 16777216)

Resolution

This issue usually occurs with Docker container runtime.

Kubernetes kubelet is responsible for retrieving the current state of all containers (part of PLEG) via gRPC with container runtime. This gRPC has a maximum message size of 16 MB. If the metadata of all containers on a node grow beyond this 16 MB limit, kubelet can not query the state of its containers any more, finally leading to a "Not Ready" Kubernetes worker node. For more information, refer to Bug: ResourceExhausted desc = grpc: received message larger than max (4195017 vs. 4194304) #63858.

A restart of kubelet or even a reboot of the node does not fix this issue, as the containers with the huge overall metadata are still there and not removed.

To work around this issue, manually remove unused containers to reduce total metadata size.

To manually remove unused containers, follow these execution steps:

ubuntu@jumpbox:~$ bosh -d service-instance_*** ssh worker/0
...
worker/180bad63-e295-4d0a-b410-cd51c39cc253:~$ sudo -i
worker/180bad63-e295-4d0a-b410-cd51c39cc253:~# /var/vcap/packages/docker/bin/docker system prune
WARNING! This will remove:
  - all stopped containers
  - all networks not used by at least one container
  - all dangling images
  - all dangling build cache

Are you sure you want to continue? [y/N] y
Deleted Containers:
25adab361e569378a099d544494c9f94fea1fd3c3252aa075a60bfecfe6cb441
...
f613fb850d8ab35dce007120bc0790950b9ace922146d61749569874b8fed89a

Deleted Images:
untagged: registry.tkg.vmware.run/pause@sha256:c2fb43c43c279b1103dae3523b5817dfb48fc9c6001170401447b3da722973a0
deleted: sha256:84ee8339528d282608cc5923a6093277d68478f5137932ed40e6b25480f91f2a
...
deleted: sha256:74ed84873cc71d35e5cf83208bec3365573619b88976742e5096ebfacca89870

Total reclaimed space: 12.45GB

worker/180bad63-e295-4d0a-b410-cd51c39cc253:~# monit restart docker && monit restart kubelet

Feedback

thumb_up Yes

thumb_down No