vSphere with Tanzu Guest Cluster Fails to Upgrade to TKR Version v1.23.8---vmware.3-tkg.1
search cancel

vSphere with Tanzu Guest Cluster Fails to Upgrade to TKR Version v1.23.8---vmware.3-tkg.1

book

Article ID: 323417

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere Kubernetes Service

Issue/Introduction

Symptoms: 

  • After upgrading to vCenter version 7.0U3F Build 20051473, or when using vCenter 7.0U3E Build 19717403, AND when using Supervisor Cluster on builds prior to vsc0.0.17; guest clusters cannot be updated to Tanzu Kubernetes Release (TKR) version v1.23.8---vmware.3-tkg.1
  • When attempting to upgrade the TKC, kubectl get tkc will display the current version as  v1.23.8---vmware.3-tkg.1,  however new cluster nodes will not be created to replace the existing nodes running on the same version before the attempted upgrade.
  • TKC will remain functional and report with status READY true.
  • The vmware-system-tkg-controller-manager logs will show the below error repeatedly when trying to migrate the CoreDNS component to the version included in the TKR:
# kubectl logs -c manager -n vmware-system-tkg vmware-system-tkg-controller-manager-57bb4d68f6-g7tjw
...
I0211 23:23:09.233662       1 control_plane_sync.go:200] vmware-system-tkg-controller-manager/tanzukubernetescluster-spec-controller/ns01/clusterv1a2 "msg"="Executing rolling update for KubeadmControlPlane"  "cluster"="clusterv1a2"
E0211 23:23:09.404734       1 tanzukubernetescluster_controller.go:418] vmware-system-tkg-controller-manager/tanzukubernetescluster-spec-controller/ns01/clusterv1a2 "msg"="Unable to reconcile control plane for cluster" "error"="Unable to sync KubeadmControlPlane for cluster \"clusterv1a2\": admission webhook \"validation.kubeadmcontrolplane.controlplane.cluster.x-k8s.io\" denied the request: KubeadmControlPlane.controlplane.cluster.x-k8s.io \"clusterv1a2-control-plane\" is invalid: spec.kubeadmConfigSpec.clusterConfiguration.dns.imageTag: Forbidden: cannot migrate CoreDNS up to '1.8.6' from '1.8.4': cannot migrate up to '1.8.6' from '1.8.4'"  "cluster"="clusterv1a2"
...
  • New wcpmachinetemplate objects for the cluster control plane are spawned without deleting the older ones. A count of these objects increases over time:
# kubectl get wcpmachinetemplate
NAME                                  AGE
clusterv1a2-control-plane-4fgtx       43m
clusterv1a2-control-plane-85jr4       42m
clusterv1a2-control-plane-89bxk       15m
clusterv1a2-control-plane-99fxq       5m49s
clusterv1a2-control-plane-gc75d       15m
clusterv1a2-control-plane-hcrtj       32m
clusterv1a2-control-plane-jczz4       40m
clusterv1a2-control-plane-rcbht       2d18h
clusterv1a2-control-plane-t9ccj       35m
clusterv1a2-control-plane-tkdxq       25m
clusterv1a2-control-plane-zbwrz       42m
clusterv1a2-control-plane-zl99m       37m

clusterv1a2-worker-nodepool-1-5f2cp   2d18h
  • Example of Issue in CLI:
# kubectl edit tkc     --------------------> Upgrading TKC version in this edit
tanzukubernetescluster.run.tanzu.vmware.com/clusterv1a2 edited

# kubectl get tkc     --------------------> TKC version reports update
NAME         CONTROL PLANE  WORKER  TKR NAME                  AGE    READY  TKR COMPATIBLE  UPDATES AVAILABLE
clusterv1a2  3              2       v1.23.8---vmware.3-tkg.1  2d17h  True   True 

# kubectl get machine     --------------------> Machine objects are not updated
NAME                                                  CLUSTER       NODENAME                                              PROVIDERID                                       PHASE     AGE     VERSION
clusterv1a2-control-plane-rfzh9                       clusterv1a2   clusterv1a2-control-plane-rfzh9                       vsphere://423c1849-dbc6-b8c8-4bbd-fdf5810d7ec0   Running   2d17h   v1.22.9+vmware.1
clusterv1a2-control-plane-tlclm                       clusterv1a2   clusterv1a2-control-plane-tlclm                       vsphere://423c687f-d025-6eb4-210d-58876f37c971   Running   2d17h   v1.22.9+vmware.1
clusterv1a2-control-plane-zjgg8                       clusterv1a2   clusterv1a2-control-plane-zjgg8                       vsphere://423c7301-fb2f-1795-e07e-2e07581db88b   Running   2d17h   v1.22.9+vmware.1
clusterv1a2-worker-nodepool-1-mjfsl-79dd67c94-6j7fx   clusterv1a2   clusterv1a2-worker-nodepool-1-mjfsl-79dd67c94-6j7fx   vsphere://423c7956-9048-a600-3049-70f4a4f65f22   Running   2d17h   v1.22.9+vmware.1
clusterv1a2-worker-nodepool-1-mjfsl-79dd67c94-8rx4s   clusterv1a2   clusterv1a2-worker-nodepool-1-mjfsl-79dd67c94-8rx4s   vsphere://423c43d4-f09c-e039-be7a-155dafc4b70b   Running   2d17h   v1.22.9+vmware.1



Environment

VMware vSphere 7.0 with Tanzu

Cause


The CAPI webhook validations for CoreDNS only include versions 1.8.5 and higher, TKR versions prior to v1.23.8---vmware.3-tkg.1 are not compatible with this requirement. When the validation fails after kicking off the upgrade, the controller is not able to build and deploy new objects, producing the behavior described above.

This might also block users from new guest cluster creation using
TKR Version v1.23.8---vmware.3-tkg.1

Resolution


After the vCenter is upgraded to Version 7.0 U3F Build 20051473, upgrade the Supervisor Control Plane to version v1.22.6+vmware.1-vsc0.0.17-20026652 or later
 
Once the Supervisor Cluster is on version v1.22.6+vmware.1-vsc0.0.17-20026652, upgrade the TKC to version v1.23.8---vmware.3-tkg.1 again.

Note:
The vSphere with Tanzu version should be 0.0.17. This version can be checked by looking at the current Supervisor Cluster version, specifically the numbers after vsc.
Example: 
v1.22.6+vmware.1-vsc0.0.17-20026652

 

  • Before the Supervisor Control Plane update:

 

  • After the Supervisor Control Plane update:

 


To resolve this issue - please contact Broadcom support and open a support request regarding the issue.