Control Plane rollout freezes during Guest Cluster upgrade or VM Class change

search cancel

Control Plane rollout freezes during Guest Cluster upgrade or VM Class change

book

Article ID: 431501

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime

Issue/Introduction

Control Plane rollout is freezing due to missing labels and taints for newly deployed control plane nodes during a Guest Cluster upgrade or VM Class change.
The cloud-init.log of the stuck control plane indicates a timeout when attempting to patch the node API object:
[YYYY-MM-DD hh:mm:ss] I#### hh:mm:ss #### patchnode.go:##] [patchnode] Uploading the CRI Socket information "unix:///var/run/containerd/containerd.sock" to the Node API object "<Guest-Cluster-Control-Plane-name>" as an annotation
[YYYY-MM-DD hh:mm:ss] error execution phase kubelet-wait-bootstrap: error writing CRISocket for this node: error patching Node "<Guest-Cluster-Control-Plane-name>": Patch "https://<Control-Plane-IP-address>:6443/api/v1/nodes/<Guest-Cluster-Control-Plane-name>?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
The following label and taint are missing from the newly deployed control plane:
Labels: node-role.kubernetes.io/master=
Taints: node-role.kubernetes.io/master:NoSchedule
The API server stops accepting requests, and ETCD pod logs report a NOSPACE alarm and database space exceeded errors.
var/log/pods/kube-system_kube-apiserver-######/etcd/#.log has below error:
YYYY-MM-DD hh:mm:ssZ stderr F E#### hh:mm:ss ## status.go:##] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0x#, desc:\"etcdserver: mvcc: database space exceeded\"}: etcdserver: mvcc: database space exceeded" logger="UnhandledError"
/var/log/pods/kube-system_kube-apiserver-######/kube-apiserver/#.log has below error:
YYYY-MM-DD hh:mm:ssZ stderr F {"level":"warn","ts":"YYYY-MM-DD hh:mm:ssZ","caller":"etcdserver/util.go:###","msg":"apply request took too long","took":"ss.####ms","expected-duration":"100ms","prefix":"","request":"header:<ID:######## username:\"kube-apiserver-etcd-client\" auth_revision:1 > alarm:<action:ACTIVATE memberID:######## alarm:NOSPACE > ","response":"size:##"}

Environment

VMware vSphere with Tanzu

Cause

The ETCD database is full. The api-server cannot accept requests due to a storage exceeded error on etcd. This is caused by aquasecurity objects created by trivy-system rapidly filling the ETCD database, leading to a NOSPACE alarm.

Resolution

Verify the objects consuming ETCD database space by executing the following command on the control plane:
etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key --endpoints https://<control-plane-ip>:2379 get /registry --prefix --keys-only | grep -v ^$ | awk -F '/' '{ h[$3]++ } END {for (k in h) print h[k], k}' | sort -nr
Once identified, remove the excessive aquasecurity objects generated by trivy-system to free up database space.
Compact/defragment the ETCD database to restore API server functionality and complete the upgrade process by following the below KB.
Compact and defragment the ETCD DB

Feedback

thumb_up Yes

thumb_down No