Error: worker.count fields are not specified in the variable schema of variable \"worker\", spec

Products

VMware Tanzu Kubernetes Grid Management

Issue/Introduction

After upgrading the TKG management cluster to TKG 2.3 workload clusters are unable to update using the Tanzu CLI and present no error in CLI output.

Environment

TKG workload clusters deployed with a custom clusterClass in TKG 2.2.
Upgraded TKG management cluster to TKG 2.3.
Attempted upgrade of TKG workload cluster to TKG 2.3 does not start with the following:

tanzu cluster upgrade -n <WorkloadClusterNamespace> <WorkloadClusterName>

compatibility file (/root/.config/tanzu/tkg/compatibility/tkg-compatibility.yaml) already exists, skipping download BOM files inside /root/.config/tanzu/tkg/bom already exists, skipping downloadError: no available upgrades for cluster '<WorkloadClusterName>', namespace '<WorkloadClusterNamespace>'

The tanzu-addons-controller-manager-<PodID> pod in the management cluster context shows errors similar to the following:

E0826 14:48:11.747858 1 controller. go:3291"msg"="Reconciler error" "error"="admission webhook {"default.cluster, Cluster.x-k8s.101" denied the request: Cluster.cluster.x-k85.10{"CLUSTER_NAME" is invalid: (spectopology.workers.machineDeployments[J.variables.overrides:Invalid value: ("x";failed validation: ("worker.count)" fields are not specified in the variable schema of variable \"worker\", spec. topology.workers.machineDeployments[1j.variables,overrides: Invalid value: ("j": failed validation:("worker.count\" fields are not specified in the variable schema of variable ("worker\"]""Cluster"= ("name": "CLUSTER_NAME""namespace"; "CLUSTER_NAMESPACE")ier"="cluster""controllerGroup"="cluster.x-k8s. 10""controllerkind"="Cluster""name'"="$CLUSTER_NAME""namespace"="CLUSTER_NAMESPACE"E0826 14:48:11.840707controller. go: 329]"msg"="Reconcilererror""error"="cannot get the bom configuration:"reconcileID"="OBJECT_RECONCILE_ID"

The running workload cluster yaml file contains a worker.count spec similar to the following and also the replicas spec:

workers:
      machineDeployments:
      - class: tkgm-worker
        metadata:
          annotations:
            run.tanzu.vmware.com/resolve-os-image: os-name=ubuntu,os-arch=amd64
        name: pool-0
        replicas: 2 <-------------- Correct CAPI 1.3 spec for machine counts
        variables:
          overrides:
          - name: nodePoolLabels
            value:
            - key: pool
              value: wkld
          - name: worker
            value:
              count: 1 <------------------------ Problem Spec
              machine:
                customVMXKeys: {}
                diskGiB: ###
                memoryMiB: ####
                numCPUs: ###

Cause

After the TKG 2.3 upgrade to the management cluster the ClusterAPI (CAPI) components on the management cluster are upgraded to CAPI 1.3. This will also update all cluster objects to the new CAPI version. CAPI 1.3 has improved validation for the variables defined in the ClusterClass objects this updates the clusters with a new replica spec that replaces worker.count but does not remove the legacy variable from the cluster. This is noted in the TKG 2.3 release notes as a known issue for previously provisioned clusters and is resolved in a future release of TKG.

Resolution

Switch context to the TKG management cluster using kubctl

kubectl config use-context <MANAGMENT_CLUSTER_CONTEXT>
Get the name and namespace of the cluster that is failing to upgrade using kubectl:

kubectl get clusters -A |grep <CLUSTER_NAME>
Backup the cluster object with kubectl using the information retrieved from the previous command:

kubectl get cluster -n <CLUSTER_NAME_SPACE> <CLUSTER_NAME> -o yaml > cluster.backup.yaml
Edit the cluster object with kubectl:

kubectl edit cluster -n <CLUSTER_NAME_SPACE> <CLUSTER_NAME>
Locate the lines in the yaml that contain the spec worker.count and validate that there is a replica spec

workers:
      machineDeployments:
      - class: tkgm-worker
        metadata:
          annotations:
            run.tanzu.vmware.com/resolve-os-image: os-name=ubuntu,os-arch=amd64
        name: pool-0
        replicas: 2 <-------------- Correct CAPI 1.3 spec for machine counts
        variables:
          overrides:
          - name: nodePoolLabels
            value:
            - key: pool
              value: wkld
          - name: worker
            value:
              count: 1 <------------------------ Problem Spec
              machine:
                customVMXKeys: {}
                diskGiB: ###
                memoryMiB: ####
                numCPUs: ###

6. Remove any line under workers.machineDeployments that contain count save the cluster config.

7. Retry the TKG workload cluster upgrade command:

tanzu cluster upgrade -n <WorkloadClusterNamespace> <WorkloadClusterName>