vSphere Supervisor Kubernetes Clusters Fail to Upgrade due to Gatekeeper - failed calling webhook "check-ignore-label.gatekeeper.sh"

Products

VMware vSphere with Tanzu VMware vSphere 7.0 with Tanzu vSphere with Tanzu Tanzu Kubernetes Runtime

Issue/Introduction

This KB is written to ensure the vSphere Kubernetes Cluster completes its upgrade process entirely when the gatekeeper deployment in a cluster has caused the upgrade to become stuck.

Symptoms:

While connected to the affected vSphere Kubernetes Cluster's context, if gatekeeper is installed, the upgrade process may become stuck and exhibit the following symptoms:

Control plane nodes will deploy to a new version but will not reach Ready state:
- ```
kubectl get nodes
```
The following pods are running on the affected cluster:
- ```
kubectl get pods -A
```
- docker-registry
- etcd
- kube-apiserver
- kube controller manager
- kube-proxy (but no proxy is in use within this environment)
- kube-scheduler
- kube-prometheus-node-exporter
The following pods are not Running and stuck in ImagePullBackOff error state:
- ```
kubectl get pods -A | grep -v Run
```
- antrea-agent
- vsphere-csi-node
When describing the above pods stuck with ImagePullBackOff status, the a similar error message to the below is returned:
- ```
kubectl describe -n <ImagePullBackOff pod namespace> <ImagePullBackOff pod name>
```

 waiting:
 message: 'rpc error: code = NotFound desc = failed to pull and unpack image
 "localhost:5000/vmware.io/antrea/antrea:v#.#.#_vmware.#": failed to resolve reference "localhost:5000/vmware.io/antrea/antrea:v#.#.#_vmware.#": localhost:5000/vmware.io/antrea/antrea:v#.#.#_vmware.#: not found'
        reason: ErrImagePull

Environment

vSphere with Tanzu 8.0

This KB is tailored regarding Tanzu Mission Control's installation of gatekeeper, but this issue can also occur for clusters where gatekeeper was installed manually.

Cause

In vSphere 8.0 and higher, the Supervisor Cluster creates a kapp-controller package install on the upgrading workload cluster and relies on that to create any new add-ons, including CNI (antrea/calico). The kapp-controller PackageInstall fails due to the following errors:

 ---- applying 2 changes [0/6 done] ----
 noop apiservice/v1alpha1.data.packaging.carvel.dev (apiregistration.k8s.io/v1) cluster
 create namespace/tkg-system (v1) cluster
 ^ Retryable error: Creating resource namespace/tkg-system (v1) cluster: API server says: Internal error occurred: failed calling webhook "check-ignore-label.gatekeeper.sh": failed to call webhook: Post "https://gatekeeper-webhook-service.gatekeeper.svc:443/v1/admitlabel?timeout=3s": dial tcp ###.###.###.###:443: i/o timeout (reason: InternalError)

The gatekeeper ValidatingWebhookConfiguration does not allow the kapp-controller to complete the install of necessary CNI pod images, thus the upgrade stalls.

Resolution

Workaround A

The above-mentioned webhook can be edited to fail-open as below:

kubectl edit validatingwebhookconfiguration gatekeeper-validating-webhook-configuration

Search for "failurePolicy: Fail" and change it to "failurePolicy: Ignore"
Allow for the affected cluster's upgrade to complete.

Note: Tanzu Mission Control (TMC) operators periodically overwrite the above configuration to the desired state (back to "Fail").

Workaround B

If the above steps do not work please proceed with temporarily removing the webhook from the workload cluster to complete the upgrade process. Below are the steps that need to be followed in order to complete the upgrade:

Log into the affected cluster's context
Find the gatekeeper namespace:
- ```
kubectl get pods -A |grep gatekeeper
```

Backup the existing validatingwebhookconfiguration and mutatingwebhookconfiguration objects to a file:

kubectl get validatingwebhookconfiguration -n <gatekeeper namespace> gatekeeper-validating-webhook-configuration -o yaml > gatekeeper-validating-webhook-backup.yaml

kubectl get mutatingwebhookconfiguration -n <gatekeeper namespace> gatekeeper-mutating-webhook-configuration -o yaml > gatekeeper-mutating-webhook-backup.yaml

IMPORTANT: If you are directly connected to the control plane node through SSH, copy the backup.yamls created in the previous step to the Supervisor cluster or VCSA

During the rolling upgrade process, nodes are expected to be replaced and recreated. The backup yamls will be lost if the current machine gets replaced and they are not copied off of the node.

Scale down the gatekeeper deployments:

kubectl get deploy -n <gatekeeper namespace>

Note down the current deployment replica values in order to scale them back up properly after the upgrade completes.

kubectl scale deploy -n <gatekeeper namespace> gatekeeper-controller-manager --replicas=0

kubectl scale deploy -n <gatekeeper namespace> gatekeeper-operator-manager --replicas=0

Delete the validatingwebhookconfiguration and mutatingwebhookconfiguration:
- CAUTION: Only delete the webhookconfigurations associated with gatekeeper and that were backed up in the previous step.
  - Deleting webhooks other than the gatekeeper webhooks will render the cluster unsupported.
- ```
kubectl delete validatingwebhookconfiguration -n <gatekeeper namespace> gatekeeper-validating-webhook-configuration
```
- ```
kubectl delete mutatingwebhookconfiguration -n <gatekeeper namespace> gatekeeper-mutating-webhook-configuration
```
Allow for the affected cluster's upgrade to finish:
- ```
watch kubectl get nodes
```
Apply the backup yaml file to reinstate the gatekeeper validatingwebhookconfiguration and mutatingwebhookconfiguration:
- These may need to be copied from the location the backup.yamls were copied to as per Step 3.
- ```
kubectl apply -f gatekeeper-validating-webhook-backup.yaml
```
- ```
kubectl apply -f gatekeeper-mutating-webhook-backup.yaml
```

Check that the webhooks were restored successfully:

kubectl get validatingwebhookconfiguration -n <gatekeeper namespace>

kubectl get mutatingwebhookconfiguration -n <gatekeeper namespace>

If need be, scale the gatekeeper deployments back up where X is the original replica count noted down in Step 5:

kubectl get deploy -n <gatekeeper namespace>

kubectl scale deploy -n <gatekeeper namespace> gatekeeper-controller-manager --replicas=X

kubectl scale deploy -n <gatekeeper namespace> gatekeeper-operator-manager --replicas=X

Confirm gatekeeper is running after scaling the replicas back up:
- ```
kubectl get pods -n <gatekeeper namespace>
```

Additional Information

Impact/Risks:
Workload clusters affected by this issue are unable to complete their upgrade process, remaining stuck mid-upgrade.