Unable to create workload clusters in an air-gapped environment after upgrading TKGm from v1.2.1 to v1.3.1.

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

This article provides troubleshooting steps to remediate x509 certificate error encountered after cluster upgrade from v1.2.1 to v1.3.1. Hence, blocking new cluster creation on v1.3.1.

Symptoms:

The following are some symptoms you encounter when you face this issue:

1. While creating the workload clusters on v1.3.1 after cluster upgrade, you will see that cluster is stuck in "createStalled" state.

2. You see that the coredns pod creation is failing when you run kubectl get all -A on the the workload controlplane node.

3. Also, you will see below errors encountered for specific pods running on management cluster :

"tkr-controller-manager" pod logs

2021-09-16T18:00:29.786Z INFO Failed to complete initial TKR discovery {"error": "failed to sync up TKRs with the BOM repository: failed to reconcile the BOM ConfigMap: failed to list current available BOM image tags: Get \"https://vslharbor.jura.ch/v2/\": x509: certificate signed by unknown authority"}

"tanzu-addons-controller-manager" pod logs

I1101 14:03:17.778895 1 addon_controller.go:101] controllers/Addon "msg"="Reconciling cluster" "cluster-name"="vsltkg-c01p" "cluster-ns"="production" I1101 14:03:17.779509 1 addon_controller.go:237] controllers/Addon "msg"="Bom not found" "cluster-name"="vsltkg-c04t" "cluster-ns"="development" I1101 14:03:17.779663 1 addon_controller.go:237] controllers/Addon "msg"="Bom not found" "cluster-name"="vsltkg-c01p" "cluster-ns"="production"

"cert-manager-webhook" pod logs

I0902 06:52:51.519009 1 dynamic_source.go:191] cert-manager/webhook "msg"="Signed new serving certificate"
I0902 06:52:51.524019 1 dynamic_source.go:197] cert-manager/webhook "msg"="Updated serving TLS certificate"
I0902 06:52:58.732030 1 logs.go:52] http: TLS handshake error from 100.96.4.1:14002: remote error: tls: bad certificate
I0902 06:52:58.732927 1 logs.go:52] http: TLS handshake error from 100.96.4.1:41608: remote error: tls: bad certificate

"kapp-controller" pod logs

{"level":"info","ts":1635634095.1287181,"logger":"kc.controller.ar","msg":"Reconcile noop","request":"tkg-system/tanzu-addons-manager"} {"level":"info","ts":1635634095.1287382,"logger":"kc.controller.pr","msg":"Requeue after given time","request":"tkg-system/tanzu-addons-manager","after":30.049378625} {"level":"info","ts":1635634125.1848357,"logger":"kc.controller.ar","msg":"Started deploy","request":"tkg-system/tanzu-addons-manager"} {"level":"info","ts":1635634126.0959315,"logger":"kc.controller.ar","msg":"Completed deploy","request":"tkg-system/tanzu-addons-manager"}

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Cause

This issue happens due to the nonexistent x509 certificate (ca.crt) of private registry in kapp and tkr controller.
To avoid this issue the ca.crt should be added in kapp-controller-config and tkr-controller-configs configmap.

Resolution

There is no resolution for this issue however a workaround is available to fix this problem.

Workaround:
You can follow the steps below to workaround this issue :

Switch to management cluster config and get tkr-controller-config and kapp-controller-config configmaps. You will notice there is no ca.crt provided for both configmaps.

kubectl config use-context management-context
kubectl get cm tkr-controller-config -n tkr-system

apiVersion: v1
data:
  caCerts: ""
  imageRepository: ""
kind: ConfigMap
metadata:

kubectl get cm kapp-controller-config -n tkg-system

apiVersion: v1
data:
  caCerts: ""
  dangerousSkipTLSVerify: ""
  httpProxy: ""
  httpsProxy: ""
  noProxy: ""

As a workaround, inject the ca.crt of private registry in the kapp-controller-config and tkr-controller-config configmaps and restart the kapp-controller and tkr-controller-manager pods to point to the valid certificate.

kubectl edit cm tkr-controller-config -n tkr-system
kubectl edit cm kapp-controller-config -n tkg-system

caCerts: |
  -----BEGIN CERTIFICATE-----
  <Existing Certificate>
  -----END CERTIFICATE-----
  -----BEGIN CERTIFICATE-----
  <New Certificate>
  -----END CERTIFICATE-----

kubectl delete pod tkr-controller-manager -n tkr-system
kubectl delete pod kapp-controller -n tkg-system

NOTE : After completing above steps, you will see no error in the pod logs and you can proceed with workload cluster creation by following below docs :

https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/1.3/vmware-tanzu-kubernetes-grid-13/GUID-tanzu-k8s-clusters-deploy.html