bosh -d <SERVICE_INSTANCE> is --ps reports all Tanzu Kubernetes Grid Integrated Edition (TKGI) cluster jobs are healthy. However, kubectl get nodes reports some worker nodes are "Not Ready".
Kubelet logs under /var/vcap/sys/log/kubelet keeps repeating the following error:
E1208 13:21:33.114542 11981 remote_runtime.go:277] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (16848940 vs. 16777216) E1208 13:21:33.114542 11981 kuberuntime_container.go:395] getKubeletContainers failed: rpc error: code = ResourceExhaused desc = grpc: trying to send message larger than max (16848940 vs. 16777216) E1208 13:21:33.114542 11981 generic.go:205] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhaused desc = grpc: trying to send message larger than max (16848940 vs. 16777216)
This issue usually occurs with Docker container runtime.
Kubernetes kubelet is responsible for retrieving the current state of all containers (part of PLEG) via gRPC with container runtime. This gRPC has a maximum message size of 16 MB. If the metadata of all containers on a node grow beyond this 16 MB limit, kubelet can not query the state of its containers any more, finally leading to a "Not Ready" Kubernetes worker node. For more information, refer to Bug: ResourceExhausted desc = grpc: received message larger than max (4195017 vs. 4194304) #63858.
A restart of kubelet or even a reboot of the node does not fix this issue, as the containers with the huge overall metadata are still there and not removed.
To work around this issue, manually remove unused containers to reduce total metadata size.
To manually remove unused containers, follow these execution steps:
ubuntu@jumpbox:~$ bosh -d service-instance_*** ssh worker/0 ... worker/180bad63-e295-4d0a-b410-cd51c39cc253:~$ sudo -i worker/180bad63-e295-4d0a-b410-cd51c39cc253:~# /var/vcap/packages/docker/bin/docker system prune WARNING! This will remove: - all stopped containers - all networks not used by at least one container - all dangling images - all dangling build cache Are you sure you want to continue? [y/N] y Deleted Containers: 25adab361e569378a099d544494c9f94fea1fd3c3252aa075a60bfecfe6cb441 ... f613fb850d8ab35dce007120bc0790950b9ace922146d61749569874b8fed89a Deleted Images: untagged: registry.tkg.vmware.run/pause@sha256:c2fb43c43c279b1103dae3523b5817dfb48fc9c6001170401447b3da722973a0 deleted: sha256:84ee8339528d282608cc5923a6093277d68478f5137932ed40e6b25480f91f2a ... deleted: sha256:74ed84873cc71d35e5cf83208bec3365573619b88976742e5096ebfacca89870 Total reclaimed space: 12.45GB worker/180bad63-e295-4d0a-b410-cd51c39cc253:~# monit restart docker && monit restart kubelet