Issue Summary:
In this scenario, an upgrade was performed on a TKG Management Cluster from version 2.3.1 to 2.4.1.
NOTE: Similar Management Cluster upgrade failures were reported in older TKG versions as well.
Errors:
The main error:
tanzu mc upgrade: management cluster upgrade fails with the below error:Error: upgrade version compatibility validation failed: unable to get tkg version of management cluster "CLUSTER_NAME" in namespace "": unable to get the cluster object: Internal error occurred: error resolving resource
Other errors you may also see:
kube-apiserver log may show:unable to load root certificates: unable to parse bytes as PEM block
cert-manager-cainjector pod may show:cert-manager/secret-for-certificate-mapper "msg"="unable to fetch certificate that owns the secret" "error"="Certificate.cert-manager.io \"capi-serving-cert\" not found" "certificate"={"Namespace":"capi-system","Name":"capi-serving-cert"} "secret"={"Namespace":"capi-system","Name":"capi-webhook-service-cert"}
ako-operator may not be able to communicate with the kube-apiserver and may report:"error"="Certificate.cert-manager.io \"capi-serving-cert\" not found"
Validation:
Run the following commands against your Management Cluster to verify if this is the same issue.
If you do not show the following Issuer and Certificate, then you likely have the same issue.
kubectl config use-context MANAGEMENT_CLUSTER_CONTEXT
capi-serving-cert exists in the capi-system Namespace:kubectl get certificate -n capi-system
You should see at least the following certificate:
NAMESPACE NAME READY SECRET AGE
capi-system capi-serving-cert True capi-webhook-service-cert 34m
capi-selfsigned-issuer exists in the capi-system Namespace:kubectl get issuer -n capi-system
You should see at least the following certificate:
NAMESPACE NAME READY AGE
capi-system capi-selfsigned-issuer True 39m
Tanzu Kubernetes Grid (TKG): 2.3.1
Tanzu Kubernetes Grid (TKG): 2.4.1
Summary:
The failure results when the capi-serving-cert Certificate in the capi-system Namespace goes missing.
Details:
The tanzu CLI calls the clusterctl API (open source API). It then performs a clusterctl upgrade.
This is an upstream component that takes care of upgrading the version of the Cluster API providers (CRDs, controllers) installed into a management cluster.
Then tanzu CLI waits for it to succeed. This is where the failure occurs.
Although the clusterctl upgrade is designed to be "idempotent", it is possible that it is unable to recover in this scenario.
The capi-system/capi-serving-cert is already missing, which is not part of the clusterctl API design.
This requires networking in the cluster to be working during the upgrade.
Given this, the actual cause of the missing Certificate and Issuer is not clear. It may be a symptom of a infrastructure or network failure occurring during the clusterctl upgrade.
Open a Tanzu Support request case. A Tanzu Engineer will assess your system further before applying the Issuer and Certificate manifest to back to your Cluster API.
Steps:
tanzu diagnostics bundle from your Management Cluster