vSphere Supervisor Kubernetes Clusters Fail to Upgrade due to Gatekeeper - failed calling webhook "check-ignore-label.gatekeeper.sh"
search cancel

vSphere Supervisor Kubernetes Clusters Fail to Upgrade due to Gatekeeper - failed calling webhook "check-ignore-label.gatekeeper.sh"

book

Article ID: 323447

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service VMware vSphere 7.0 with Tanzu vSphere with Tanzu Tanzu Kubernetes Runtime

Issue/Introduction

This KB is written to ensure the vSphere Kubernetes Cluster completes its upgrade process entirely when the gatekeeper deployment in a cluster has caused the upgrade to become stuck.

Symptoms:

While connected to the affected vSphere Kubernetes Cluster's context, if gatekeeper is installed, the upgrade process may become stuck and exhibit the following symptoms:
  • Control plane nodes will deploy to a new version but will not reach Ready state:
    • kubectl get nodes
  • The following pods are running on the affected cluster:
    • kubectl get pods -A
    • docker-registry
    • etcd
    • kube-apiserver
    • kube controller manager
    • kube-proxy (but no proxy is in use within this environment)
    • kube-scheduler
    • kube-prometheus-node-exporter

  • The following pods are not Running and stuck in ImagePullBackOff error state:
    • kubectl get pods -A | grep -v Run
    • antrea-agent
    • vsphere-csi-node

  • When describing the above pods stuck with ImagePullBackOff status, the a similar error message to the below is returned:
    • kubectl describe -n <ImagePullBackOff pod namespace> <ImagePullBackOff pod name>
 waiting:
 message: 'rpc error: code = NotFound desc = failed to pull and unpack image
"localhost:5000/vmware.io/antrea/antrea:v#.#.#_vmware.#": failed to resolve reference "localhost:5000/vmware.io/antrea/antrea:v#.#.#_vmware.#": localhost:5000/vmware.io/antrea/antrea:v#.#.#_vmware.#: not found'
       reason: ErrImagePull


Environment

vSphere with Tanzu 8.0
 
This KB is tailored regarding Tanzu Mission Control's installation of gatekeeper, but this issue can also occur for clusters where gatekeeper was installed manually.

Cause

In vSphere 8.0 and higher, the Supervisor Cluster creates a kapp-controller package install on the upgrading workload cluster and relies on that to create any new add-ons, including CNI (antrea/calico). The kapp-controller PackageInstall fails due to the following errors: 
 ---- applying 2 changes [0/6 done] ----
 noop apiservice/v1alpha1.data.packaging.carvel.dev (apiregistration.k8s.io/v1) cluster
 create namespace/tkg-system (v1) cluster
 ^ Retryable error: Creating resource namespace/tkg-system (v1) cluster: API server says: Internal error occurred: failed calling webhook "check-ignore-label.gatekeeper.sh": failed to call webhook: Post "https://gatekeeper-webhook-service.gatekeeper.svc:443/v1/admitlabel?timeout=3s": dial tcp ###.###.###.###:443: i/o timeout (reason: InternalError)

 
The gatekeeper ValidatingWebhookConfiguration does not allow the kapp-controller to complete the install of necessary CNI pod images, thus the upgrade stalls.

Resolution

Contact Broadcom support for assistance completing the upgrade.

Additional Information

Impact/Risks:
Workload clusters affected by this issue are unable to complete their upgrade process, remaining stuck mid-upgrade.