When upgrading a workload cluster to a higher vSphere Kubernetes Release (VKR) version, the cluster's conditions show as unhealthy regarding Antrea.
If the workload cluster has not yet retired the corresponding Tanzu Kubernetes Cluster (TKC), the TKC will show as Ready False state.
However, aside from the Antrea unhealthy conditions, the workload cluster successfully deployed its nodes and all pods are healthy without issues.
This issue can be identified most notably through the below symptoms:
message: ClusterBootstrap conditions Antrea-ReconcileFailed indicate reconcile has failed
reason: ClusterBootstrapFailed
reason: ContainerNetworkingNotInstalled
severity: Warning
status: "False"
type: NetworkProviderReconciled
message: |-
kapp: Error: Timed out waiting after 30s for resources:
- apiservice/v1beta1.system.antrea.io (apiregistration.k8s.io/v1) cluster
- apiservice/v1beta2.controlplane.antrea.io (apiregistration.k8s.io/v1) cluster
- apiservice/v1alpha1.stats.antrea.io (apiregistration.k8s.io/v1) cluster
status: "True"
type: Antrea-ReconcileFailed
Error from server (ServiceUnavailable): the server is currently unable to handle the requestkubectl get apiservice | grep antrea
v1alpha1.crd.antrea.io Local True
v1alpha1.crd.antrea.tanzu.vmware.com Local True
v1alpha1.stats.antrea.io kube-system/antrea False (FailedDiscoveryCheck)
v1alpha2.crd.antrea.io Local True
v1alpha3.crd.antrea.io Local True
v1beta1.crd.antrea.io Local True
v1beta1.system.antrea.io kube-system/antrea False (FailedDiscoveryCheck)
v1beta2.controlplane.antrea.io kube-system/antrea False (FailedDiscoveryCheck)
kubectl get app -n vmware-system-tkg | grep antrea
kubectl describe app -n vmware-system-tkg <antrea app>
conditions:
- message: 'Deploying: Error (see .status.usefulErrorMessage for details)'
status: "True"
type: ReconcileFailed
kapp: Error: Timed out waiting after 30s for resources:
- apiservice/v1alpha1.stats.antrea.io (apiregistration.k8s.io/v1) cluster
- apiservice/v1beta2.controlplane.antrea.io (apiregistration.k8s.io/v1) cluster
- apiservice/v1beta1.system.antrea.io (apiregistration.k8s.io/v1) cluster
Namespace Name Kind Age Op Op st. Wait to Rs Ri
(cluster) v1alpha1.stats.antrea.io APIService 1y - - reconcile ongoing Condition Available is not True
(False)
^ v1beta1.system.antrea.io APIService 1y - - reconcile ongoing Condition Available is not True
(False)
^ v1beta2.controlplane.antrea.io APIService 1y - - reconcile ongoing Condition Available is not True
(False)
vSphere Supervisor
When antrea apiresources are requested over the kube-apiserver, it is going to a service or virtual machine (VM) with an identical IP to the antrea service within the workload cluster.
Kube-proxy started later than kube-apiserver system pod in the node. As a result, the established connection returns the above error status messages because it is querying a service that is not the antrea service itself. Additionally, this occurs because the established connection to the service with the identical IP was both reachable and not refused.
The established connection must be corrected on all control plane nodes in the affected workload cluster.
This is because the control plane endpoint (also known as the VIP) for the workload cluster load balances kube-apiserver requests between all control plane nodes in the workload cluster.
If the established connection is erroneous on one, the requests will occasionally fail when it is load balanced to that control plane node with the faulty connection.
This issue has been flagged to upstream Kubernetes: https://github.com/kubernetes/kubernetes/issues/135883
The connection configured from the kube-apiserver to the antrea service needs to be corrected.
kubectl get pods -n kube-system | grep "kube-apiserver"
kubectl delete pod -n kube-system <kube-apiserver pod name>kubectl get ds,pkgi,app -n kube-system | grep antrea