You would notice Control plane and worker nodes version got upgraded to 1.26.x and the nodes were rolled out successfully and the upgrade process failed at upgrading the core packages.
If you were to check the status you will notice the TKG packages were in Reconcile Failed state:
========
kubectl get pkgi -A
NAMESPACE NAME PACKAGE NAME PACKAGE VERSION DESCRIPTION AGE
tca-system istio istio.telco.vmware.com 2.3.0-23226971 Reconcile failed: Error (see .status.usefulErrorMessage for details) 49d
tca-system nodeconfig-operator nodeconfig-operator.telco.vmware.com 2.3.0-23477874 Reconcile succeeded 49d
tca-system tca-diagnosis-operator tca-diagnosis-operator.telco.vmware.com 2.3.0-23573090 Reconcile succeeded 49d
tca-system tca-kubecluster-operator tca-kubecluster-operator.telco.vmware.com 2.3.0-23437142 Reconciling 49d
tca-system test-controller test-controller.telco.vmware.com 2.3.0-22712580 Reconcile succeeded 49d
tca-system vmconfig-operator vmconfig-operator.telco.vmware.com 2.3.0-23473338 Reconcile succeeded 49d
tkg-system mgmt-antrea antrea.tanzu.vmware.com 1.11.2+vmware.1-tkg.1-advanced Reconcile succeeded 49d
tkg-system mgmt-capabilities capabilities.tanzu.vmware.com 0.30.2+vmware.1 Reconcile succeeded 49d
tkg-system mgmt-metrics-server metrics-server.tanzu.vmware.com 0.6.2+vmware.1-tkg.3 Reconcile succeeded 49d
tkg-system mgmt-pinniped pinniped.tanzu.vmware.com 0.24.0+vmware.1-tkg.1 Reconcile succeeded 49d
tkg-system mgmt-secretgen-controller secretgen-controller.tanzu.vmware.com 0.14.2+vmware.2-tkg.2 Reconcile succeeded 49d
tkg-system mgmt-tkg-storageclass tkg-storageclass.tanzu.vmware.com 0.30.2+vmware.1 Reconcile succeeded 49d
tkg-system mgmt-vsphere-cpi vsphere-cpi.tanzu.vmware.com 1.26.2+vmware.1-tkg.1 Reconcile succeeded 49d
tkg-system mgmt-vsphere-csi vsphere-csi.tanzu.vmware.com 3.0.1+vmware.4-tkg.1 Reconcile succeeded 49d
tkg-system tanzu-addons-manager addons-manager.tanzu.vmware.com 0.30.2+vmware.1 Reconcile succeeded 49d
tkg-system tanzu-auth tanzu-auth.tanzu.vmware.com 0.30.2+vmware.1 Reconcile succeeded 49d
tkg-system tanzu-cliplugins cliplugins.tanzu.vmware.com 0.30.2+vmware.1 Reconcile succeeded 49d
tkg-system tanzu-core-management-plugins core-management-plugins.tanzu.vmware.com 0.30.2+vmware.1 Reconcile succeeded 49d
tkg-system tanzu-featuregates featuregates.tanzu.vmware.com 0.30.2+vmware.1 Reconcile succeeded 49d
tkg-system tanzu-framework framework.tanzu.vmware.com 0.30.2+vmware.1 Reconcile succeeded 49d
tkg-system tkg-clusterclass tkg-clusterclass.tanzu.vmware.com 0.30.2+vmware.1 Reconcile failed: Error (see .status.usefulErrorMessage for details) 49d
tkg-system tkg-clusterclass-vsphere tkg-clusterclass-vsphere.tanzu.vmware.com 0.30.2+vmware.1 Reconcile failed: Error (see .status.usefulErrorMessage for details) 49d
tkg-system tkg-pkg tkg.tanzu.vmware.com 0.30.2+vmware.1 Reconcile failed: Error (see .status.usefulErrorMessage for details) 49d
tkg-system tkr-service tkr-service.tanzu.vmware.com 0.30.2+vmware.1 Reconcile succeeded 49d
tkg-system tkr-source-controller tkr-source-controller.tanzu.vmware.com 0.30.2+vmware.1 Reconciling 49d
tkg-system tkr-vsphere-resolver tkr-vsphere-resolver.tanzu.vmware.com 0.30.2+vmware.1
========
If you were to check the app status for failure:
========
kubectl -n tkg-system get apps tkg-clusterclass -o yaml
apiVersion: kappctrl.k14s.io/v1alpha1
kind: App
metadata:
annotations:
packaging.carvel.dev/package-ref-name: tkg-clusterclass.tanzu.vmware.com
packaging.carvel.dev/package-version: 0.30.2+vmware.1
creationTimestamp: "2024-07-02T15:48:40Z"
finalizers:
- finalizers.kapp-ctrl.k14s.io/delete
generation: 4
name: tkg-clusterclass
namespace: tkg-system
<O/P Redacted>
stderr: |-
I0821 06:28:38.733639 29330 request.go:690] Waited for 1.046921765s due to client-side throttling, not priority and fairness, request: GET:https://100.64.0.1:443/apis/cli.tanzu.vmware.com/v1alpha1
kapp: Error: waiting on reconcile packageinstall/tkg-clusterclass-vsphere (packaging.carvel.dev/v1alpha1) namespace: tkg-system:
Finished unsuccessfully (Reconcile failed: (message: I0821 06:26:29.178450 29288 request.go:690) Waited for 1.035634738s due to client-side throttling, not priority and fairness, request: GET:https://100.64.0.1:443/apis/storage.k8s.io/v1
kapp: Error: Listing schema.GroupVersionResource{Group:"cluster.x-k8s.io", Version:"v1beta1", Resource:"clusterclasses"}, namespaced: true:
Internal error occurred: error resolving resource))
stdout: |-
Target cluster 'https://100.64.0.1:443' (nodes: mgmt-s2hk6-grrgd, 3+)
Changes
Namespace Name Kind Age Op Op st. Wait to Rs Ri
tkg-system object-propagation-controller-manager Deployment 49d - - reconcile ongoing Waiting for 1 unavailable replicas
^ tkg-clusterclass-vsphere PackageInstall 49d - - reconcile fail Reconcile failed: (message: I0821
06:26:29.178450 29288
request.go:690] Waited for
1.035634738s due to client-side
throttling, not priority and
fairness, request:
GET:https://100.64.0.1:443/apis/storage.k8s.io/v1
kapp: Error: Listing
schema.GroupVersionResource{Group:"cluster.x-k8s.io",
Version:"v1beta1",
Resource:"clusterclasses"},
namespaced: true:
Internal error occurred: error
resolving resource)
Op: 0 create, 0 delete, 0 update, 2 noop, 0 exists
Wait to: 2 reconcile, 0 delete, 0 noop
6:28:40AM: ---- applying 2 changes [0/2 done] ----
6:28:40AM: noop packageinstall/tkg-clusterclass-vsphere (packaging.carvel.dev/v1alpha1) namespace: tkg-system
6:28:40AM: noop deployment/object-propagation-controller-manager (apps/v1) namespace: tkg-system
6:28:40AM: ---- waiting on 2 changes [0/2 done] ----
6:28:40AM: fail: reconcile packageinstall/tkg-clusterclass-vsphere (packaging.carvel.dev/v1alpha1) namespace: tkg-system
6:28:40AM: ^ Reconcile failed: (message: I0821 06:26:29.178450 29288 request.go:690] Waited for 1.035634738s due to client-side throttling, not priority and fairness, request: GET:https://100.64.0.1:443/apis/storage.k8s.io/v1
kapp: Error: Listing schema.GroupVersionResource{Group:"cluster.x-k8s.io", Version:"v1beta1", Resource:"clusterclasses"}, namespaced: true:
Internal error occurred: error resolving resource)
Deleted 1 older app changes
updatedAt: "2024-08-21T06:28:41Z"
fetch:
exitCode: 0
startedAt: "2024-08-21T06:28:37Z"
stdout: |
apiVersion: vendir.k14s.io/v1alpha1
directories:
- contents:
- imgpkgBundle:
image: airgap4.inspur.com/registry/packages/management/repo@sha256:2f4d7078d01c9cbca3b6e8ca67b1661a9d00ee27e5330ebd4ab6f7737b531507
path: .
path: "0"
kind: LockConfig
updatedAt: "2024-08-21T06:28:37Z"
friendlyDescription: 'Reconcile failed: Deploying: Error (see .status.usefulErrorMessage
for details)'
observedGeneration: 4
template:
exitCode: 0
stderr: |
resolve | final: object-propagation-controller:latest -> airgap4.inspur.com/registry/packages/management/repo@sha256:a4919dd644cec59d46cb8ec07fd504020996045ee781ae918174de01177d67dc
updatedAt: "2024-08-21T06:28:37Z"
usefulErrorMessage: |-
I0821 06:28:38.733639 29330 request.go:690] Waited for 1.046921765s due to client-side throttling, not priority and fairness, request: GET:https://100.64.0.1:443/apis/cli.tanzu.vmware.com/v1alpha1
kapp: Error: waiting on reconcile packageinstall/tkg-clusterclass-vsphere (packaging.carvel.dev/v1alpha1) namespace: tkg-system:
Finished unsuccessfully (Reconcile failed: (message: I0821 06:26:29.178450 29288 request.go:690) Waited for 1.035634738s due to client-side throttling, not priority and fairness, request: GET:https://100.64.0.1:443/apis/storage.k8s.io/v1
kapp: Error: Listing schema.GroupVersionResource{Group:"cluster.x-k8s.io", Version:"v1beta1", Resource:"clusterclasses"}, namespaced: true:
Internal error occurred: error resolving resource))
========
This error indicates it cannot fetch the clusterclass CR and similarly the other apps also failing trying to list other clusterapi CRs like cluster object.
If you were to check if all the crd's were present and you would notice all of them would be present however the objects on those CRDs when you try to get would fail with the below error:
========
kubectl get clusters -A
Internal error occurred: error resolving resource
========
System pods would also fail trying to list these clusterapi CRs
========
kubectl get po -A | grep -v Running
NAMESPACE NAME READY STATUS RESTARTS AGE
capv-system capv-controller-manager-7c75cb48d4-582wj 0/1 CrashLoopBackOff 1798 (4m10s ago) 7d21h
nodehealthchecker nodehealthchecker-node-87bds-jmwpd 0/1 Terminating 0 8d
nodehealthchecker nodehealthchecker-node-87bds-r9mv5 0/1 Terminating 0 8d
nodehealthchecker nodehealthchecker-node-87bds-z2qrx 0/1 Terminating 0 8d
tca-system vmconfig-operator-796b4788ff-gm927 1/2 ImagePullBackOff 1605 (9m52s ago) 8d
tkg-system tanzu-addons-controller-manager-745c96484c-z9mzv 0/1 CrashLoopBackOff 2036 (3m39s ago) 14d
tkg-system tkr-source-controller-manager-5fbcbcb5bd-4z94r 0/1 CrashLoopBackOff 1796 (76s ago) 8d
========
This happens if the capi-serving-cert in capi-system namespace is missing
Once you review and confirm all the symptoms and if this was indeed due to the missing capi-serving-cert you can recreate it by taking the manifest of Certificate and Issuer from the path ~/.config/tanzu/tkg/providers/cluster-api/v1.4.2/core-components.yaml
Sample manifest for TKG 2.3.x (though this wont change for TKG versions it is better to use from the corresponding version of TKG as there is a chance for apiVersion to get changed)
========
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
labels:
cluster.x-k8s.io/provider: cluster-api
name: capi-serving-cert
namespace: capi-system
spec:
dnsNames:
- capi-webhook-service.capi-system.svc
- capi-webhook-service.capi-system.svc.cluster.local
issuerRef:
kind: Issuer
name: capi-selfsigned-issuer
secretName: capi-webhook-service-cert
subject:
organizations:
- k8s-sig-cluster-lifecycle
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
labels:
cluster.x-k8s.io/provider: cluster-api
name: capi-selfsigned-issuer
namespace: capi-system
spec:
selfSigned: {}
========
Once the above manifest is created you can check using "kubectl get certificates" and all the certs would be in True state.
At this point you can try to query some clusterapi CR objects like clusters/clusterclass/machines and they should start working.
System pods and TKG related packages would come to Running state and you can retry the procedure at this stage.