Unable to create workload clusters in an air-gapped environment after upgrading TKGm from v1.2.1 to v1.3.1.
search cancel

Unable to create workload clusters in an air-gapped environment after upgrading TKGm from v1.2.1 to v1.3.1.

book

Article ID: 327446

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

This article  provides troubleshooting steps to remediate x509 certificate error encountered after cluster upgrade from v1.2.1 to v1.3.1. Hence, blocking new cluster creation on v1.3.1. 


Symptoms:

The following are some symptoms you encounter when you face this issue:

1. While creating the workload clusters on v1.3.1 after cluster upgrade, you will see that cluster is stuck in "createStalled" state.

2. You see that the  coredns pod creation is failing when you run kubectl get all -A on the the workload controlplane node.

step1.png


3.  Also, you will see below errors encountered for specific pods running on management cluster : 
  •  "tkr-controller-manager" pod logs 
2021-09-16T18:00:29.786Z INFO Failed to complete initial TKR discovery {"error": "failed to sync up TKRs with the BOM repository: failed to reconcile the BOM ConfigMap: failed to list current available BOM image tags: Get \"https://vslharbor.jura.ch/v2/\": x509: certificate signed by unknown authority"}
  • "tanzu-addons-controller-manager" pod logs
I1101 14:03:17.778895    1 addon_controller.go:101] controllers/Addon "msg"="Reconciling cluster" "cluster-name"="vsltkg-c01p" "cluster-ns"="production"  I1101 14:03:17.779509    1 addon_controller.go:237] controllers/Addon "msg"="Bom not found" "cluster-name"="vsltkg-c04t" "cluster-ns"="development"  I1101 14:03:17.779663    1 addon_controller.go:237] controllers/Addon "msg"="Bom not found" "cluster-name"="vsltkg-c01p" "cluster-ns"="production" 
  • "cert-manager-webhook" pod logs
I0902 06:52:51.519009 1 dynamic_source.go:191] cert-manager/webhook "msg"="Signed new serving certificate"
I0902 06:52:51.524019 1 dynamic_source.go:197] cert-manager/webhook "msg"="Updated serving TLS certificate"
I0902 06:52:58.732030  1 logs.go:52] http: TLS handshake error from 100.96.4.1:14002: remote error: tls: bad certificate
I0902 06:52:58.732927 1 logs.go:52] http: TLS handshake error from 100.96.4.1:41608: remote error: tls: bad certificate
  • "kapp-controller" pod logs
{"level":"info","ts":1635634095.1287181,"logger":"kc.controller.ar","msg":"Reconcile noop","request":"tkg-system/tanzu-addons-manager"} {"level":"info","ts":1635634095.1287382,"logger":"kc.controller.pr","msg":"Requeue after given time","request":"tkg-system/tanzu-addons-manager","after":30.049378625} {"level":"info","ts":1635634125.1848357,"logger":"kc.controller.ar","msg":"Started deploy","request":"tkg-system/tanzu-addons-manager"} {"level":"info","ts":1635634126.0959315,"logger":"kc.controller.ar","msg":"Completed deploy","request":"tkg-system/tanzu-addons-manager"}
 
Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Cause

This issue happens due to the nonexistent x509 certificate (ca.crt) of private registry in kapp and tkr controller. 
To avoid this issue the ca.crt  should be added  in  kapp-controller-config and tkr-controller-configs configmap. 

Resolution

There is no resolution for this issue however a workaround is available to fix this problem.

Workaround:
You can follow the steps below to workaround this issue :
  • Switch to management cluster config and get tkr-controller-config and kapp-controller-config configmaps.  You will notice there is no ca.crt provided for both configmaps. 
kubectl config use-context management-context
kubectl get  cm tkr-controller-config -n tkr-system
apiVersion: v1
data:
  caCerts: ""
  imageRepository: ""
kind: ConfigMap
metadata:

kubectl get cm kapp-controller-config -n tkg-system

apiVersion: v1
data:
  caCerts: ""
  dangerousSkipTLSVerify: ""
  httpProxy: ""
  httpsProxy: ""
  noProxy: ""
As a workaround, inject the ca.crt of private registry in the  kapp-controller-config and  tkr-controller-config  configmaps and restart the kapp-controller and  tkr-controller-manager  pods to point to the valid certificate. 

kubectl edit cm tkr-controller-config -n tkr-system
kubectl edit cm kapp-controller-config -n tkg-system
caCerts: |
  -----BEGIN CERTIFICATE-----
  <Existing Certificate>
  -----END CERTIFICATE-----
  -----BEGIN CERTIFICATE-----
  <New Certificate>
  -----END CERTIFICATE-----

kubectl delete pod tkr-controller-manager -n tkr-system
kubectl delete pod  kapp-controller -n tkg-system


NOTE : After completing above steps, you will see no error in the pod logs and you can proceed with workload cluster creation by following below docs :

https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/1.3/vmware-tanzu-kubernetes-grid-13/GUID-tanzu-k8s-clusters-deploy.html