Control Plane rollout freezes during Guest Cluster upgrade or VM Class change
search cancel

Control Plane rollout freezes during Guest Cluster upgrade or VM Class change

book

Article ID: 431501

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime

Issue/Introduction

  • Control Plane rollout is freezing due to missing labels and taints for newly deployed control plane nodes during a Guest Cluster upgrade or VM Class change.

  • The cloud-init.log of the stuck control plane indicates a timeout when attempting to patch the node API object:
    [YYYY-MM-DD hh:mm:ss] I#### hh:mm:ss #### patchnode.go:##] [patchnode] Uploading the CRI Socket information "unix:///var/run/containerd/containerd.sock" to the Node API object "<Guest-Cluster-Control-Plane-name>" as an annotation
    [YYYY-MM-DD hh:mm:ss] error execution phase kubelet-wait-bootstrap: error writing CRISocket for this node: error patching Node "<Guest-Cluster-Control-Plane-name>": Patch "https://<Control-Plane-IP-address>:6443/api/v1/nodes/<Guest-Cluster-Control-Plane-name>?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

  • The following label and taint are missing from the newly deployed control plane:
    Labels: node-role.kubernetes.io/master=
    Taints: node-role.kubernetes.io/master:NoSchedule

  • The API server stops accepting requests, and ETCD pod logs report a NOSPACE alarm and database space exceeded errors.
  • var/log/pods/kube-system_kube-apiserver-######/etcd/#.log has below error:
    YYYY-MM-DD hh:mm:ssZ stderr F E#### hh:mm:ss  ## status.go:##] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: rpctypes.EtcdError{code:0x#, desc:\"etcdserver: mvcc: database space exceeded\"}: etcdserver: mvcc: database space exceeded" logger="UnhandledError"
  • /var/log/pods/kube-system_kube-apiserver-######/kube-apiserver/#.log has below error:
    YYYY-MM-DD hh:mm:ssZ stderr F {"level":"warn","ts":"YYYY-MM-DD hh:mm:ssZ","caller":"etcdserver/util.go:###","msg":"apply request took too long","took":"ss.####ms","expected-duration":"100ms","prefix":"","request":"header:<ID:######## username:\"kube-apiserver-etcd-client\" auth_revision:1 > alarm:<action:ACTIVATE memberID:######## alarm:NOSPACE > ","response":"size:##"}

Environment

  • VMware vSphere with Tanzu

Cause

The ETCD database is full. The api-server cannot accept requests due to a storage exceeded error on etcd. This is caused by aquasecurity objects created by trivy-system rapidly filling the ETCD database, leading to a NOSPACE alarm.

Resolution

  1. Verify the objects consuming ETCD database space by executing the following command on the control plane:
    etcdctl --cacert /etc/kubernetes/pki/etcd/ca.crt --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key --endpoints https://<control-plane-ip>:2379 get /registry --prefix --keys-only | grep -v ^$ | awk -F '/' '{ h[$3]++ } END {for (k in h) print h[k], k}' | sort -nr

  2. Once identified, remove the excessive aquasecurity objects generated by trivy-system to free up database space.
  3. Compact/defragment the ETCD database to restore API server functionality and complete the upgrade process by following the below KB.
    Compact and defragment the ETCD DB