Issue Summary:
In this scenario, an upgrade was performed on a TKG Management Cluster from version 2.3.1 to 2.4.1.
NOTE: Similar Management Cluster upgrade failures were reported in older TKG versions as well.
Errors:
The main error:
tanzu mc upgrade
: management cluster upgrade fails with the below error:Error: upgrade version compatibility validation failed: unable to get tkg version of management cluster "CLUSTER_NAME" in namespace "": unable to get the cluster object: Internal error occurred: error resolving resource
Other errors you may also see:
kube-apiserver
log may show:unable to load root certificates: unable to parse bytes as PEM block
cert-manager-cainjector
pod may show:cert-manager/secret-for-certificate-mapper "msg"="unable to fetch certificate that owns the secret" "error"="Certificate.cert-manager.io \"capi-serving-cert\" not found" "certificate"={"Namespace":"capi-system","Name":"capi-serving-cert"} "secret"={"Namespace":"capi-system","Name":"capi-webhook-service-cert"}
ako-operator
may not be able to communicate with the kube-apiserver
and may report:"error"="Certificate.cert-manager.io \"capi-serving-cert\" not found"
Validation:
Run the following commands against your Management Cluster to verify if this is the same issue.
If you do not show the following Issuer and Certificate, then you likely have the same issue.
kubectl config use-context MANAGEMENT_CLUSTER_CONTEXT
capi-serving-cert
exists in the capi-system
Namespace:kubectl get certificate -n capi-system
You should see at least the following certificate:
NAMESPACE NAME READY SECRET AGE
capi-system capi-serving-cert True capi-webhook-service-cert 34m
capi-selfsigned-issuer
exists in the capi-system
Namespace:kubectl get issuer -n capi-system
You should see at least the following certificate:
NAMESPACE NAME READY AGE
capi-system capi-selfsigned-issuer True 39m
Tanzu Kubernetes Grid (TKG): 2.3.1
Tanzu Kubernetes Grid (TKG): 2.4.1
Summary:
The failure results when the capi-serving-cert
Certificate in the capi-system
Namespace goes missing.
Details:
The tanzu
CLI calls the clusterctl
API (open source API). It then performs a clusterctl upgrade
.
This is an upstream component that takes care of upgrading the version of the Cluster API providers (CRDs, controllers) installed into a management cluster.
Then tanzu
CLI waits for it to succeed. This is where the failure occurs.
Although the clusterctl upgrade
is designed to be "idempotent", it is possible that it is unable to recover in this scenario.
The capi-system/capi-serving-cert
is already missing, which is not part of the clusterctl
API design.
This requires networking in the cluster to be working during the upgrade.
Given this, the actual cause of the missing Certificate
and Issuer
is not clear. It may be a symptom of a infrastructure or network failure occurring during the clusterctl upgrade
.
Open a Tanzu Support request case. A Tanzu Engineer will assess your system further before applying the Issuer and Certificate manifest to back to your Cluster API.
Steps:
tanzu diagnostics
bundle from your Management Cluster