vSphere Supervisor Workload Cluster Unhealthy due to Antrea ServiceUnavailable Bad Status 404

Products

VMware vSphere Kubernetes Service

Issue/Introduction

When upgrading a workload cluster to a higher vSphere Kubernetes Release (VKR) version, the cluster's conditions show as unhealthy regarding Antrea.

If the workload cluster has not yet retired the corresponding Tanzu Kubernetes Cluster (TKC), the TKC will show as Ready False state.

However, aside from the Antrea unhealthy conditions, the workload cluster successfully deployed its nodes and all pods are healthy without issues.

This issue can be identified most notably through the below symptoms:

From the Supervisor cluster context, describing the cluster or TKC returns either of the below unhealthy conditions:

message: ClusterBootstrap conditions Antrea-ReconcileFailed indicate reconcile has failed
    reason: ClusterBootstrapFailed

reason: ContainerNetworkingNotInstalled
    severity: Warning
    status: "False"
    type: NetworkProviderReconciled

From the Supervisor cluster context, describing the clusterbootstrap for the affected workload cluster returns the similar kapp errors to the below:

message: |-
      kapp: Error: Timed out waiting after 30s for resources:
        - apiservice/v1beta1.system.antrea.io (apiregistration.k8s.io/v1) cluster
        - apiservice/v1beta2.controlplane.antrea.io (apiregistration.k8s.io/v1) cluster
        - apiservice/v1alpha1.stats.antrea.io (apiregistration.k8s.io/v1) cluster
status: "True"
    type: Antrea-ReconcileFailed

Within the affected workload cluster's context, trying any kubectl commands on the antrea CustomResourceDefintions (CRDs) associated with the above affected antrea APIservices returns an error message similar to the below:
```
Error from server (ServiceUnavailable): the server is currently unable to handle the request
```

Within the affected workload cluster's context, performing a describe on the antrea app shows that one or more antrea APIservices are in Condition Available False state:

kubectl get apiservice | grep antrea
v1alpha1.crd.antrea.io                   Local                        True
v1alpha1.crd.antrea.tanzu.vmware.com     Local                        True
v1alpha1.stats.antrea.io                 kube-system/antrea           False (FailedDiscoveryCheck)
v1alpha2.crd.antrea.io                   Local                        True
v1alpha3.crd.antrea.io                   Local                        True
v1beta1.crd.antrea.io                    Local                        True
v1beta1.system.antrea.io                 kube-system/antrea           False (FailedDiscoveryCheck)
v1beta2.controlplane.antrea.io           kube-system/antrea           False (FailedDiscoveryCheck)

kubectl get app -n vmware-system-tkg | grep antrea

kubectl describe app -n vmware-system-tkg <antrea app>

 conditions:
  - message: 'Deploying: Error (see .status.usefulErrorMessage for details)'
    status: "True"
    type: ReconcileFailed

 kapp: Error: Timed out waiting after 30s for resources:
        - apiservice/v1alpha1.stats.antrea.io (apiregistration.k8s.io/v1) cluster
        - apiservice/v1beta2.controlplane.antrea.io (apiregistration.k8s.io/v1) cluster
        - apiservice/v1beta1.system.antrea.io (apiregistration.k8s.io/v1) cluster

Namespace    Name                                Kind        Age  Op      Op st.  Wait to    Rs       Ri
      (cluster)    v1alpha1.stats.antrea.io            APIService  1y   -       -       reconcile  ongoing  Condition Available is not True
                                                                                                            (False)
      ^            v1beta1.system.antrea.io            APIService  1y   -       -       reconcile  ongoing  Condition Available is not True
                                                                                                            (False)
      ^            v1beta2.controlplane.antrea.io      APIService  1y   -       -       reconcile  ongoing  Condition Available is not True
                                                                                                            (False)

Environment

vSphere Supervisor

Cause

When antrea apiresources are requested over the kube-apiserver, it is going to a service or virtual machine (VM) with an identical IP to the antrea service within the workload cluster.

Kube-proxy started later than kube-apiserver system pod in the node. As a result, the established connection returns the above error status messages because it is querying a service that is not the antrea service itself. Additionally, this occurs because the established connection to the service with the identical IP was both reachable and not refused.

The established connection must be corrected on all control plane nodes in the affected workload cluster.

This is because the control plane endpoint (also known as the VIP) for the workload cluster load balances kube-apiserver requests between all control plane nodes in the workload cluster.

If the established connection is erroneous on one, the requests will occasionally fail when it is load balanced to that control plane node with the faulty connection.

This issue has been flagged to upstream Kubernetes: https://github.com/kubernetes/kubernetes/issues/135883

Resolution

The connection configured from the kube-apiserver to the antrea service needs to be corrected.

Connect into the affected workload cluster context.
Retrieve the list of all kube-apiserver pods in the workload cluster:
```
kubectl get pods -n kube-system | grep "kube-apiserver"
```
Perform a restart on each kube-apiserver pod one by one, waiting for the previous one to come up before moving onto the next:
```
kubectl delete pod -n kube-system <kube-apiserver pod name>
```
Confirm on the status of the antrea daemonset, pkgi, app which should shortly return True healthy and Reconcile Succeeded healthy states:
```
kubectl get ds,pkgi,app -n kube-system | grep antrea
```
If the above steps do not correct the issue, reach out to VMware by Broadcom Technical Support and reference this KB article.