Guest Cluster provisioning gets stuck because the carvel apps aren't deployed inside the Guest cluster Control Plane Node.
search cancel

Guest Cluster provisioning gets stuck because the carvel apps aren't deployed inside the Guest cluster Control Plane Node.

book

Article ID: 405920

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • The Guest Cluster deployment gets stuck at the stage where out of 3 or more, only one control plane node gets provisioned.

  • When logging inside the provisioned control plane node, you see that all the core components are present (kube-api server, etcd, coreDNS etc). However, the core VMware packages (antrea, kapp-controller, secretgen-controller etc) and their respective components are missing.

  • The command "k get pkgi -A" returns "No resource found". On top of this, the vmware-system-tkg/tkg-system namespace inside the node is missing.

  • Because the CNI inside the node is missing, even though the etcd and api-server is up the control plane node is unable to communicate its readiness to the Cluster API. As a result, CAPI is declaring the control plane node to be unhealthy and as a result is not letting it spin up the remaining Control Plane nodes.

  • Per tkg-controller on the supervisor cluster, it is unable to find the Clusterbootstrap object for the affected cluster. 

      E0730 1 controller.go:329]  "msg"="Reconciler error" "error"="Failed to update auth service addon config: Error updating corresponding clusterbootstrap resource: failed to find ClusterBootstrap resource owned by cluster <namespace>/<cluster-name>" "Cluster"={"name":"<cluster-name>","namespace":"<namespace>"}
      "controller"="cluster" "controllerGroup"="cluster.x-k8s.io""controllerKind"="Cluster" "name"="<cluster-name>" "namespace"="<namespace>" "reconcileID"="<ID>"


  • On checking further, you notice that the Clusterbootstrap is indeed missing for the affected cluster. The command "k get clusterbootstrap -n <namespace where the cluster is deployed>" return nothing for the concerned guest cluster.

  • Per the Add-on controller/Cluster-bootstrap controller, it is unable to fetch the Antrea package details from the concerned API resource.

      E0731  1 clusterbootstrapclone.go:621] ClusterBootstrapController "msg"="unable to fetch Package.Spec.RefName or Package.Spec.Version from Package ns-sharedservice/antrea.tanzu.vmware.com.1.13.3+vmware.3-tkg.1-vmware" "error"="no matches for kind \"Package\" in version \"data.packaging.carvel.dev/v1alpha1\""
      E0731  1 clusterbootstrapclone.go:564] ClusterBootstrapController "msg"="unable to clone secrets or providers" "error"="no matches for kind \"Package\" in version \"data.packaging.carvel.dev/v1alpha1\""
      E0731  1 controller.go:329]  "msg"="Reconciler error" "error"="no matches for kind \"Package\" in version \"data.packaging.carvel.dev/v1alpha1\"" "Cluster"={"name":"<cluster-name>","namespace":"<namespace>"} "controller"="cluster" "controllerGroup"="cluster.x-k8s.io" "controllerKind"="Cluster" "name"="<cluster-name>" "namespace"="<namespace>" "reconcileID"="<ID>

Environment

VMware vSphere Kubernetes Service

Cause

The certificate of the packaging APIService has expired. In this case it is "v1alpha1.data.packaging.carvel.dev". The issuer of these certificates is the kapp-controller on the supervisor cluster. To confirm the same, run the below command.

     kubectl get apiservice v1alpha1.data.packaging.carvel.dev -o jsonpath='{.spec.caBundle}' | base64 -d | openssl x509 -text -noout

The important point to note here is that the kapp-controller apiservice is not using the cert-manager to create and renew its certificate.

Resolution

Broadcom engineering is working on tracking the automatic renewal of the kapp-controller apiservice certificate. Meanwhile to regenerate kapp-controller apiservice certificate, follow the below steps.

  • Connect into the Supervisor cluster.

  • Perform a rollout restart of the cert-manager deployment using the command below

       kubectl rollout restart deploy -n <cert-manager-namespace>

  • Confirm that all cert-manager pods are Running before proceeding ahead. Once the cert-manager starts running again, it should automatically renew the expired certificate of the kapp-controller apiservice in the Supervisor cluster. However, the kapp-controller pods will have to be restarted in order to pick up the renewed certificates.

  • To complete the certificate renewal, delete the kapp-controller pod using the command below.

       kubectl delete pod <kapp-controller-pod> -n <kapp-controller namespace>

  • Check that the kapp-controller pod is running again after the restart. Use the below command to confirm the same.

              kubectl get pods -n <kapp-controller namespace>

Additional Information

More details on the cert-manager pod issues can be found in the kb article here- https://knowledge.broadcom.com/external/article/390661