Upgraded Management Clusters Contain Antrea Deprecated APIServices

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

Symptoms:

For all the management clusters that were created in TKG 1.6 and before, and upgraded to TKG 2.1(or then continued to upgrade to 2.2/2.3), but failed to create new workload clusters:

New workload cluster is stuck in creation while the corresponding KubeadmControlPlane’s condition is reporting “Available” and waiting for the control plane to scale up.

Status:
  Conditions:
    Last Transition Time:  2023-08-22T13:32:38Z
    Message:               Scaling up control plane to 3 replicas (actual 1)
    Reason:                ScalingUp
    Severity:              Warning
    Status:                False
    Type:                  Ready
    Last Transition Time:  2023-08-22T13:34:28Z
    Status:                True
    Type:                  Available

But on doing ssh to the control plane node, the node condition contains the error “cni plugin not initialized”.

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
…
 KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

The kapp-controller is also not installed on the workload cluster.
On the management cluster, the workload cluster’s ClusterBootstrap does not have the status field. And inside the ClusterBootstrap, the field spec.kapp.valuesFrom field (along with the other package information) are being empty. Only the spec.kapp.refName field exists

 "kapp": {
      "refName": "kapp-controller.tanzu.vmware.com.0.41.7+vmware.1-tkg.1"
    },

On the management cluster, any of the deprecated Antrea api-service below is showing in the output of “kubectl get apiservice”. (And the AVAILABLE volume is expected to be False.)
v1beta1.networking.antrea.tanzu.vmware.com
v1beta1.controlplane.antrea.tanzu.vmware.com
v1alpha1.stats.antrea.tanzu.vmware.com
v1beta1.system.antrea.tanzu.vmware.com
v1beta2.controlplane.antrea.tanzu.vmware.com

In the tanzu-addons-controller-manager pod’s log, the error “unable to retrieve server APIs” like below can be found

E0822 13:32:35.420973       1 clusterbootstrapclone.go:789] ClusterBootstrapController "msg"="failed to getGVR" "error"="unable to retrieve the complete list of server APIs: controlplane.antrea.tanzu.vmware.com/v1beta1: the server is currently unable to handle the request, controlplane.antrea.tanzu.vmware.com/v1beta2: the server is currently unable to handle the request, stats.antrea.tanzu.vmware.com/v1alpha1: the server is currently unable to handle the request, system.antrea.tanzu.vmware.com/v1beta1: the server is currently unable to handle the request"

In the api-server pod’s log, the failed response for the deprecated APIService like below can be found

E0823 00:59:15.953455       1 available_controller.go:524] v1beta1.system.antrea.tanzu.vmware.com failed with: failing or missing response from https://100.67.143.140:443/apis/system.antrea.tanzu.vmware.com/v1beta1: bad status from https://100.67.143.140:443/apis/system.antrea.tanzu.vmware.com/v1beta1: 404

Cause

Antrea ClusterResourceSet “<management-cluster-name>-antrea” is restoring the deprecated APIService on the management cluster, which can prevent the tanzu-addons-controller-manager from bootstrapping the kapp-controller and other addons like CNI to the workload cluster.

Resolution

ClusterResourceSet is no longer used on management clusters after TKG 2.1. In the future releases they will be cleaned up.
The issue is fixed in 2.4.0

Workaround:

Save a copy and delete the ClusterResourceSet <management-cluster-name>-antrea. The corresponding referenced secret <management-cluster-name>-antrea-crs can also be copied and deleted.
ClusterResourceSet is no longer needed in the management cluster after TKG 2.x. The addons like Antrea are managed by the Carvel Packages. A copy of them is saved only for reference.

kubectl get clusterresourceset <management-cluster-name>-antrea -n tkg-system -oyaml > antrea-crs.yaml
kubectl get secret <management-cluster-name>-antrea-crs -n tkg-system -oyaml > antrea-crs-secret.yaml

kubectl delete clusterresourceset <management-cluster-name>-antrea -n tkg-system
kubectl delete secret <management-cluster-name>-antrea-crs -n tkg-system

Delete the deprecated APIService if there are any

kubectl detele apiservice v1beta1.networking.antrea.tanzu.vmware.com
kubectl detele apiservice v1beta1.controlplane.antrea.tanzu.vmware.com
kubectl detele apiservice v1alpha1.stats.antrea.tanzu.vmware.com
kubectl detele apiservice v1beta1.system.antrea.tanzu.vmware.com
kubectl detele apiservice v1beta2.controlplane.antrea.tanzu.vmware.com

The tanzu-addons-controller-manager reconciliation should proceed after a time (Exponential back-off due to the errors might make this take a while. Can manually delete the addon manager pod to speedup the recovery). For the workload clusters, kapp-controller and CNI should be installed, nodes should be ready, and the control plane nodes should start scaling out.

This workaround is a one time task. After the steps above are completed, the problem will no longer exist in the future TKG upgrades.

To avoid encountering this problem, before upgrading the management cluster from TKG 1.6 to TKG 2.1, delete the ClusterResourceSet related as below if their status is failing.

kubectl get clusterresourceset <management-cluster-name>-antrea -n tkg-system -oyaml > antrea-crs.yaml
kubectl get secret <management-cluster-name>-antrea-crs -n tkg-system -oyaml > antrea-crs-secret.yaml

kubectl delete clusterresourceset <management-cluster-name>-antrea -n tkg-system
kubectl delete secret <management-cluster-name>-antrea-crs -n tkg-system

If TKG is created from 1.6 or before and already upgraded to TKG 2.1 or TKG 2.2 or TKG 2.3, do the workaround steps above before the next upgrade to avoid this issue.

Additional Information

Impact/Risks:
Unable to create Workload clusters, Control Plane node stuck in Provisioning state.