Unable to create or schedule any pods on TKGI cluster. The issue persists even when BOSH and monit summary indicate that nodes and master components are healthy. When trying to manually create a pod, the error returned shows:
Internal error occurred: failed calling webhook "vault.hashicorp.com": no endpoints available for service "vault-agent-injector-svc"
TKGi Version: 1.22.4
3rd Party Component: HashiCorp Vault Agent Injector
The Vault Agent Injector registers a MutatingWebhookConfiguration in the cluster. This configuration requires the Kubernetes API server to contact the Vault injector service synchronously for every pod creation request to determine if Vault sidecars should be injected.
In this scenario, the injector service (vault-agent-injector-svc) had no healthy backing pods (endpoints), likely due to an improper uninstallation or failure of the Vault agent deployment. Because the webhook was configured with failurePolicy: Fail, the Kubernetes API server was forced to reject all pod creation requests when it could not reach the service, even for pods that do not require Vault.
To resolve the issue, restore the health of the Vault Agent Injector service so the webhook has a valid endpoint to communicate with.
Identify the Webhook: Confirm the problematic webhook configuration: kubectl get mutatingwebhookconfigurations
Verify Service Endpoints: Check if the service referenced in the error has active endpoints: kubectl get endpoints vault-agent-injector-svc -n <namespace>
Restore the Deployment: Reinstall or repair the Vault Agent Injector deployment to ensure at least one pod is running and ready.
Temporary Workaround (Emergency Only): If the cluster must remain operational while the injector is being repaired, you can temporarily change the failurePolicy from Fail to Ignore in the MutatingWebhookConfiguration.
Warning: This may result in pods being created without required Vault sidecars.
kubectl edit mutatingwebhookconfiguration <webhook-name>
# Change failurePolicy: Fail to failurePolicy: Ignore
Clean Up (If Uninstalling): If the intent was to remove Vault, ensure that the MutatingWebhookConfiguration is manually deleted after the deployment is removed to prevent it from blocking the API server.