Management cluster Upgrade from 1.25.x to 1.26.x failure in TCA
search cancel

Management cluster Upgrade from 1.25.x to 1.26.x failure in TCA

book

Article ID: 376830

calendar_today

Updated On:

Products

Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid 1.x VMware Tanzu Kubernetes Grid Management VMware Tanzu Kubernetes Grid Plus VMware Tanzu Kubernetes Grid Plus 1.x

Issue/Introduction

You would notice Control plane and worker nodes version got upgraded to 1.26.x and the nodes were rolled out successfully and the upgrade process failed at upgrading the core packages.

If you were to check the status you will notice the TKG packages were in Reconcile Failed state:

========
kubectl get pkgi -A
NAMESPACE    NAME                            PACKAGE NAME                                PACKAGE VERSION                  DESCRIPTION                                                            AGE
tca-system   istio                           istio.telco.vmware.com                      2.3.0-23226971                   Reconcile failed: Error (see .status.usefulErrorMessage for details)   49d
tca-system   nodeconfig-operator             nodeconfig-operator.telco.vmware.com        2.3.0-23477874                   Reconcile succeeded                                                    49d
tca-system   tca-diagnosis-operator          tca-diagnosis-operator.telco.vmware.com     2.3.0-23573090                   Reconcile succeeded                                                    49d
tca-system   tca-kubecluster-operator        tca-kubecluster-operator.telco.vmware.com   2.3.0-23437142                   Reconciling                                                            49d
tca-system   test-controller                 test-controller.telco.vmware.com            2.3.0-22712580                   Reconcile succeeded                                                    49d
tca-system   vmconfig-operator               vmconfig-operator.telco.vmware.com          2.3.0-23473338                   Reconcile succeeded                                                    49d
tkg-system   mgmt-antrea                     antrea.tanzu.vmware.com                     1.11.2+vmware.1-tkg.1-advanced   Reconcile succeeded                                                    49d
tkg-system   mgmt-capabilities               capabilities.tanzu.vmware.com               0.30.2+vmware.1                  Reconcile succeeded                                                    49d
tkg-system   mgmt-metrics-server             metrics-server.tanzu.vmware.com             0.6.2+vmware.1-tkg.3             Reconcile succeeded                                                    49d
tkg-system   mgmt-pinniped                   pinniped.tanzu.vmware.com                   0.24.0+vmware.1-tkg.1            Reconcile succeeded                                                    49d
tkg-system   mgmt-secretgen-controller       secretgen-controller.tanzu.vmware.com       0.14.2+vmware.2-tkg.2            Reconcile succeeded                                                    49d
tkg-system   mgmt-tkg-storageclass           tkg-storageclass.tanzu.vmware.com           0.30.2+vmware.1                  Reconcile succeeded                                                    49d
tkg-system   mgmt-vsphere-cpi                vsphere-cpi.tanzu.vmware.com                1.26.2+vmware.1-tkg.1            Reconcile succeeded                                                    49d
tkg-system   mgmt-vsphere-csi                vsphere-csi.tanzu.vmware.com                3.0.1+vmware.4-tkg.1             Reconcile succeeded                                                    49d
tkg-system   tanzu-addons-manager            addons-manager.tanzu.vmware.com             0.30.2+vmware.1                  Reconcile succeeded                                                    49d
tkg-system   tanzu-auth                      tanzu-auth.tanzu.vmware.com                 0.30.2+vmware.1                  Reconcile succeeded                                                    49d
tkg-system   tanzu-cliplugins                cliplugins.tanzu.vmware.com                 0.30.2+vmware.1                  Reconcile succeeded                                                    49d
tkg-system   tanzu-core-management-plugins   core-management-plugins.tanzu.vmware.com    0.30.2+vmware.1                  Reconcile succeeded                                                    49d
tkg-system   tanzu-featuregates              featuregates.tanzu.vmware.com               0.30.2+vmware.1                  Reconcile succeeded                                                    49d
tkg-system   tanzu-framework                 framework.tanzu.vmware.com                  0.30.2+vmware.1                  Reconcile succeeded                                                    49d
tkg-system   tkg-clusterclass                tkg-clusterclass.tanzu.vmware.com           0.30.2+vmware.1                  Reconcile failed: Error (see .status.usefulErrorMessage for details)   49d
tkg-system   tkg-clusterclass-vsphere        tkg-clusterclass-vsphere.tanzu.vmware.com   0.30.2+vmware.1                  Reconcile failed: Error (see .status.usefulErrorMessage for details)   49d
tkg-system   tkg-pkg                         tkg.tanzu.vmware.com                        0.30.2+vmware.1                  Reconcile failed: Error (see .status.usefulErrorMessage for details)   49d
tkg-system   tkr-service                     tkr-service.tanzu.vmware.com                0.30.2+vmware.1                  Reconcile succeeded                                                    49d
tkg-system   tkr-source-controller           tkr-source-controller.tanzu.vmware.com      0.30.2+vmware.1                  Reconciling                                                            49d
tkg-system   tkr-vsphere-resolver            tkr-vsphere-resolver.tanzu.vmware.com       0.30.2+vmware.1  
========


If you were to check the app status for failure:

========
kubectl -n tkg-system get apps tkg-clusterclass -o yaml
apiVersion: kappctrl.k14s.io/v1alpha1
kind: App
metadata:
  annotations:
    packaging.carvel.dev/package-ref-name: tkg-clusterclass.tanzu.vmware.com
    packaging.carvel.dev/package-version: 0.30.2+vmware.1
  creationTimestamp: "2024-07-02T15:48:40Z"
  finalizers:
  - finalizers.kapp-ctrl.k14s.io/delete
  generation: 4
  name: tkg-clusterclass
  namespace: tkg-system
<O/P Redacted>
    stderr: |-
      I0821 06:28:38.733639   29330 request.go:690] Waited for 1.046921765s due to client-side throttling, not priority and fairness, request: GET:https://100.64.0.1:443/apis/cli.tanzu.vmware.com/v1alpha1
      kapp: Error: waiting on reconcile packageinstall/tkg-clusterclass-vsphere (packaging.carvel.dev/v1alpha1) namespace: tkg-system:
        Finished unsuccessfully (Reconcile failed:  (message: I0821 06:26:29.178450   29288 request.go:690) Waited for 1.035634738s due to client-side throttling, not priority and fairness, request: GET:https://100.64.0.1:443/apis/storage.k8s.io/v1
      kapp: Error: Listing schema.GroupVersionResource{Group:"cluster.x-k8s.io", Version:"v1beta1", Resource:"clusterclasses"}, namespaced: true:
        Internal error occurred: error resolving resource))
    stdout: |-
      Target cluster 'https://100.64.0.1:443' (nodes: mgmt-s2hk6-grrgd, 3+)
      Changes
      Namespace   Name                                   Kind            Age  Op  Op st.  Wait to    Rs       Ri
      tkg-system  object-propagation-controller-manager  Deployment      49d  -   -       reconcile  ongoing  Waiting for 1 unavailable replicas
      ^           tkg-clusterclass-vsphere               PackageInstall  49d  -   -       reconcile  fail     Reconcile failed:  (message: I0821
                                                                                                              06:26:29.178450   29288
                                                                                                              request.go:690] Waited for
                                                                                                              1.035634738s due to client-side
                                                                                                              throttling, not priority and
                                                                                                              fairness, request:
                                                                                                              GET:https://100.64.0.1:443/apis/storage.k8s.io/v1
                                                                                                              kapp: Error: Listing
                                                                                                              schema.GroupVersionResource{Group:"cluster.x-k8s.io",
                                                                                                              Version:"v1beta1",
                                                                                                              Resource:"clusterclasses"},
                                                                                                              namespaced: true:
                                                                                                                Internal error occurred: error
                                                                                                              resolving resource)
      Op:      0 create, 0 delete, 0 update, 2 noop, 0 exists
      Wait to: 2 reconcile, 0 delete, 0 noop
      6:28:40AM: ---- applying 2 changes [0/2 done] ----
      6:28:40AM: noop packageinstall/tkg-clusterclass-vsphere (packaging.carvel.dev/v1alpha1) namespace: tkg-system
      6:28:40AM: noop deployment/object-propagation-controller-manager (apps/v1) namespace: tkg-system
      6:28:40AM: ---- waiting on 2 changes [0/2 done] ----
      6:28:40AM: fail: reconcile packageinstall/tkg-clusterclass-vsphere (packaging.carvel.dev/v1alpha1) namespace: tkg-system
      6:28:40AM:  ^ Reconcile failed:  (message: I0821 06:26:29.178450   29288 request.go:690] Waited for 1.035634738s due to client-side throttling, not priority and fairness, request: GET:https://100.64.0.1:443/apis/storage.k8s.io/v1
      kapp: Error: Listing schema.GroupVersionResource{Group:"cluster.x-k8s.io", Version:"v1beta1", Resource:"clusterclasses"}, namespaced: true:
        Internal error occurred: error resolving resource)
      Deleted 1 older app changes
    updatedAt: "2024-08-21T06:28:41Z"
  fetch:
    exitCode: 0
    startedAt: "2024-08-21T06:28:37Z"
    stdout: |
      apiVersion: vendir.k14s.io/v1alpha1
      directories:
      - contents:
        - imgpkgBundle:
            image: airgap4.inspur.com/registry/packages/management/repo@sha256:2f4d7078d01c9cbca3b6e8ca67b1661a9d00ee27e5330ebd4ab6f7737b531507
          path: .
        path: "0"
      kind: LockConfig
    updatedAt: "2024-08-21T06:28:37Z"
  friendlyDescription: 'Reconcile failed: Deploying: Error (see .status.usefulErrorMessage
    for details)'
  observedGeneration: 4
  template:
    exitCode: 0
    stderr: |
      resolve | final: object-propagation-controller:latest -> airgap4.inspur.com/registry/packages/management/repo@sha256:a4919dd644cec59d46cb8ec07fd504020996045ee781ae918174de01177d67dc
    updatedAt: "2024-08-21T06:28:37Z"
  usefulErrorMessage: |-
    I0821 06:28:38.733639   29330 request.go:690] Waited for 1.046921765s due to client-side throttling, not priority and fairness, request: GET:https://100.64.0.1:443/apis/cli.tanzu.vmware.com/v1alpha1
    kapp: Error: waiting on reconcile packageinstall/tkg-clusterclass-vsphere (packaging.carvel.dev/v1alpha1) namespace: tkg-system:
      Finished unsuccessfully (Reconcile failed:  (message: I0821 06:26:29.178450   29288 request.go:690) Waited for 1.035634738s due to client-side throttling, not priority and fairness, request: GET:https://100.64.0.1:443/apis/storage.k8s.io/v1
    kapp: Error: Listing schema.GroupVersionResource{Group:"cluster.x-k8s.io", Version:"v1beta1", Resource:"clusterclasses"}, namespaced: true:
      Internal error occurred: error resolving resource))
========

This error indicates it cannot fetch the clusterclass CR and similarly the other apps also failing trying to list other clusterapi CRs like cluster object.

If you were to check if all the crd's were present and you would notice all of them would be present however the objects on those CRDs when you try to get would fail with the below error:

========
kubectl get clusters -A 
Internal error occurred: error resolving resource
========

System pods would also fail trying to list these clusterapi CRs

========
kubectl get po -A  | grep -v Running
NAMESPACE              NAME                                                     READY   STATUS             RESTARTS           AGE
capv-system            capv-controller-manager-7c75cb48d4-582wj                 0/1     CrashLoopBackOff   1798 (4m10s ago)   7d21h
nodehealthchecker      nodehealthchecker-node-87bds-jmwpd                       0/1     Terminating        0                  8d
nodehealthchecker      nodehealthchecker-node-87bds-r9mv5                       0/1     Terminating        0                  8d
nodehealthchecker      nodehealthchecker-node-87bds-z2qrx                       0/1     Terminating        0                  8d
tca-system             vmconfig-operator-796b4788ff-gm927                       1/2     ImagePullBackOff   1605 (9m52s ago)   8d
tkg-system             tanzu-addons-controller-manager-745c96484c-z9mzv         0/1     CrashLoopBackOff   2036 (3m39s ago)   14d
tkg-system             tkr-source-controller-manager-5fbcbcb5bd-4z94r           0/1     CrashLoopBackOff   1796 (76s ago)     8d
========

Cause

This happens if the capi-serving-cert in capi-system namespace is missing

Resolution

Once you review and confirm all the symptoms and if this was indeed due to the missing capi-serving-cert you can recreate it by taking the manifest of Certificate and Issuer from the path ~/.config/tanzu/tkg/providers/cluster-api/v1.4.2/core-components.yaml


Sample manifest for TKG 2.3.x (though this wont change for TKG versions it is better to use from the corresponding version of TKG as there is a chance for apiVersion to get changed)

========
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  labels:
    cluster.x-k8s.io/provider: cluster-api
  name: capi-serving-cert
  namespace: capi-system
spec:
  dnsNames:
  - capi-webhook-service.capi-system.svc
  - capi-webhook-service.capi-system.svc.cluster.local
  issuerRef:
    kind: Issuer
    name: capi-selfsigned-issuer
  secretName: capi-webhook-service-cert
  subject:
    organizations:
    - k8s-sig-cluster-lifecycle
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  labels:
    cluster.x-k8s.io/provider: cluster-api
  name: capi-selfsigned-issuer
  namespace: capi-system
spec:
  selfSigned: {}
========

 

Once the above manifest is created you can check using "kubectl get certificates" and all the certs would be in True state.
At this point you can try to query some clusterapi CR objects like clusters/clusterclass/machines and they should start working.

System pods and TKG related packages would come to Running state and you can retry the procedure at this stage.