apply-addons job fails with exit code 1 and "Internal error occurred: failed calling webhook" and Gatekeeper service "not found"
search cancel

apply-addons job fails with exit code 1 and "Internal error occurred: failed calling webhook" and Gatekeeper service "not found"

book

Article ID: 298700

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

While running a Bosh deployment, such as during a cluster upgrade or other deployment update, the apply-addons job returns an exit code 1 error.


When looking at the bosh task debug output you see the following:

EXAMPLE from:
bosh task XXXX --debug

From the:
"result_output"

You see:
"errand_name":"apply-addons","exit_code":1
and:
Internal error occurred: failed calling webhook
and:
service \"<GATEKEEPER_WEBHOOK_SERVICE_NAME>\" not found"



And if you run the apply-addon errand manually while keeping the apply-addon errand VM you will see a more human-readable error output similar to below:


EXAMPLE COMMAND: to keep the apply-addons errand after running:
bosh -d CLUSTER_SERVICE_INSTANCE run-errand apply-addons --keep-alive


OUTPUT:
Instance   apply-addons/<BOSH_INSTANCE_ID>
Exit Code  1
Stdout     
           failed to start all system specs after 1200 with exit code 1

Stderr     Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
           Error from server (InternalError): error when applying patch:
           {"metadata":{"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"v1\",\"kind\":\"Namespace\",\"metadata\":{\"annotations\":{},\"name\":\"pks-system\"}}\n"},"labels":null}}
           to:
           Resource: "/v1, Resource=namespaces", GroupVersionKind: "/v1, Kind=Namespace"
           Name: "pks-system", Namespace: ""
           for: "/var/vcap/jobs/apply-specs/specs/addon-spec.yml": Internal error occurred: failed calling webhook "check-ignore-label.gatekeeper.sh": Post https://gatekeeper-webhook-service.gatekeeper-system.svc:443/v1/admitlabel?timeout=3s: service "gatekeeper-webhook-service" not found




CAUSE:
This can happen in instances such as removing Gatekeeper from an internal Concourse Pipeline but the pipeline does not remove additional Gatekeeper objects, such as:
A PodSecurityPolicy for Gatekeeper: example: gatekeeper-admin
A validatingwebhookconfiguration for Gatekeeper
A clusterrole for Gatekeeper
A clusterrolebinding for Gatekeeper

Environment

Product Version: 1.13

Resolution

SOLUTION:
Identify orphaned objects with commands such as the ones below:

kubectl get psp -A | grep gatekeeper
kubectl get validatingwebhookconfiguration -A | grep gatekeeper
kubectl get clusterrole -A | grep gatekeeper
kubectl get crb -A | grep gatekeeper


Remove the orphaned Gatekeeper objects:

kubectl delete psp XXX
kubectl delete validatingwebhookconfiguration XXX
kubectl delete clusterrole XXX
kubectl delete crb XXX