Unable to create TKC clusters
search cancel

Unable to create TKC clusters

book

Article ID: 385432

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime vSphere with Tanzu

Issue/Introduction

  • Cluster deployments could hang due to expiry of v1alpha1.data.packaging.carvel.dev certificate.
  • Deployment shows stuck on creating

Waiting for control plane to pass preflight checks to continue reconciliation: [machine <name> does not have APIServerPodHealthy condition, machine <name> does not have ControllerManagerPodHealthy condition, machine <name> does not have SchedulerPodHealthy condition, machine <name> does not have EtcdPodHealthy condition, machine <name> does not have EtcdMemberHealthy condition]

  • Running kubectl get machines -n <namespace> shows machines deployed but no node name assigned.
  • Cluster is still waiting on the packages to be deployed by Kapp which will eventually cause the Nodes to be marked as Ready.
  • We can see after ssh to control plane of guest cluster nodes in not ready state.
root@atlp-platform-kpdc-dev2-vz9dt-qrr4p:~# kubectl get nodes
NAME                                 STATUS     ROLES           AGE   VERSION
<cluster name>-1-2c42d-849c649pv67   NotReady   <none>          27m   v1.24.9+vmware.1
<cluster name>-1-2c42d-849c64f96lm   NotReady   <none>          27m   v1.24.9+vmware.1
<cluster name>-1-2c42d-849c64k8998   NotReady   <none>          27m   v1.24.9+vmware.1
<cluster name>-vz9dt-qrr4p           NotReady   control-plane   29m   v1.24.9+vmware.1
 
 
  • We see below that none of the addon packages have been installed due to which the Nodes never move into a Ready state.
 
root@<control plane>:~# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-85949b5d7-w5fbl 0/1 Pending 0 20m
kube-system coredns-d7b8988cf-bdswc 0/1 Pending 0 20m
kube-system coredns-d7b8988cf-tgt7r 0/1 Pending 0 20m
kube-system docker-registry-<cluster name>-<pod id> 1/1 Running 0 17m
kube-system docker-registry-<cluster name>-<pod id> 1/1 Running 0 17m
kube-system docker-registry-<cluster name>-<pod id> 1/1 Running 0 17m
kube-system docker-registry-<cluster name> 1/1 Running 0 20m
kube-system etcd-<cluster name> 1/1 Running 0 20m
kube-system kube-apiserver<cluster name>1/1 Running 0 20m
kube-system kube-controller-manager-<cluster name> 1/1 Running 0 20m
kube-system kube-proxy-dq25v 1/1 Running 0 18m
kube-system kube-proxy-kbjhw 1/1 Running 0 18m
kube-system kube-proxy-ltxzg 1/1 Running 0 18m
kube-system kube-proxy-nvgx4 1/1 Running 0 20m
kube-system kube-scheduler-<cluster name> 1/1 Running 0 20m



yyyy-01-06T10:28:38.785924846Z stderr F I0106 10:28:38.785657 1 clusterbootstrap_controller.go:170] ClusterBootstrapController "msg"="Reconciling cluster" "cluster-name"="<cluster name>" "cluster-ns"=<cluster namespace>"
yyyy-01-06T10:28:38.785954184Z stderr F I0106 10:28:38.785705 1 clusterbootstrap_controller.go:389] ClusterBootstrapController "msg"="ClusterBootstrap for cluster kpdc-atlp-tkg-dev2/atlp-platform-kpdc-dev2 does not exist, creating from template vmware-system-tkg/v1.24.9---vmware.1-tkg.4" "cluster-name"="<cluster name>" "cluster-ns"=<cluster namespace>"
yyyy-01-06T10:28:38.790990703Z stderr F E0106 10:28:38.790869 1 clusterbootstrapclone.go:491] ClusterBootstrapController "msg"="unable to fetch Package.Spec.RefName or Package.Spec.Version from Package kpdc-atlp-tkg-dev2/antrea.tanzu.vmware.com.1.7.2+vmware.1-tkg.1-advanced" "error"="the server is currently unable to handle the request (get packages.data.packaging.carvel.dev antrea.tanzu.vmware.com.1.7.2+vmware.1-tkg.1-advanced)"
yyyy-01-06T10:28:38.791013803Z stderr F E0106 10:28:38.790888 1 clusterbootstrapclone.go:434] ClusterBootstrapController "msg"="unable to clone secrets or providers" "error"="the server is currently unable to handle the request (get packages.data.packaging.carvel.dev antrea.tanzu.vmware.com.1.7.2+vmware.1-tkg.1-advanced)"
yyyy-01-06T10:28:38.791019474Z stderr F E0106 10:28:38.790922 1 controller.go:317] controller/cluster "msg"="Reconciler error" "error"="the server is currently unable to handle the request (get packages.data.packaging.carvel.dev antrea.tanzu.vmware.com.1.7.2+vmware.1-tkg.1-advanced)" "name"="<cluster name>" "namespace"="<cluster namespace>" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Cluster"
 
 
  • This shows that the ClusterBootstrap object which is responsible for the addon packages to be installed on the TKC has not been created due to Package CRD reconcile issues.


Environment

vSphere with Tanzu

Cause

  • Checking api server logs, we can see the following:
     
    yyyy-01-02T14:58:28.959901427Z stderr F E0102 14:58:28.959734       1 controller.go:113] loading OpenAPI spec for "v1alpha1.data.packaging.carvel.dev" failed with: Error, could not get list of group versions for APIService
    yyyy-01-02T14:58:28.960514458Z stderr F I0102 14:58:28.960106       1 controller.go:126] OpenAPI AggregationController: action for item v1alpha1.data.packaging.carvel.dev: Rate Limited Requeue.
    yyyy-01-02T14:58:28.960537707Z stderr F E0102 14:58:28.960470       1 controller.go:116] loading OpenAPI spec for "v1alpha1.data.packaging.carvel.dev" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: error trying to reach service: x509: certificate has expired or is not yet valid: current time 2025-01-02T14:58:28Z is after 2024-09-10T08:27:51Z
    yyyy-01-02T14:58:28.96055719Z stderr F , Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
     
     
  • Check state of Kapp-controller pods and this may show as up for more than a year:
    vmware-system-appplatform-operator-system   kapp-controller-5b75b7566d-5fcqt                                  2/2     Running            0                 483d
     
     
     
  • Running below confirms expiry of certificate:
     
    kubectl get apiservice v1alpha1.data.packaging.carvel.dev -o jsonpath='{.spec.caBundle}' | base64 -d | openssl x509 -text -noout
     
    Serial Number: 2 (0x2)
        Signature Algorithm: sha256WithRSAEncryption
            Issuer: CN=kapp-controller-ca@1736330980
            Validity
                Not Before: Jan  8 09:09:40 2025 GMT
                Not After : Jan  8 09:09:40 2026 GMT
            Subject: CN=kapp-controller@1736330980
     
     

Resolution

  • Restart Kapp controller pod to renew certificate:
    kubectl -n vmware-system-appplatform-operator-system get pods
    kubectl -n vmware-system-appplatform-operator-system delete pod kapp-controller-XXX
     
     
  • Confirm cert renewal after restart:
    kubectl get apiservice v1alpha1.data.packaging.carvel.dev -o jsonpath='{.spec.caBundle}' | base64 -d | openssl x509 -text -noout