Resolving High Memory Usage and Error Messages Related to Failed Webhook Calls in Kubernetes
search cancel

Resolving High Memory Usage and Error Messages Related to Failed Webhook Calls in Kubernetes

book

Article ID: 380988

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition VMware Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid 1.x VMware Tanzu Kubernetes Grid Plus 1.x VMware Tanzu Kubernetes Grid Plus

Issue/Introduction

In Kubernetes clusters, you may encounter repeated warnings in the kube-apiserver logs indicating failed webhook calls. These errors can vary, including messages like:

  • context deadline exceeded
  • connection refused

An example log message might look like:

Failed calling webhook, failing open validation.gatekeeper.sh: failed calling webhook "validation.gatekeeper.sh": failed to call webhook: Post "https://gatekeeper-webhook-service.gatekeeper-system.svc:443/v1/admit?timeout=3s"

These errors can lead to high memory usage on master node VMs, with utilization nearing 100%, and related symptoms such as swapping, high CPU usage related to disk iowait.

Additionally, etcd logs may display warnings such as:

apply request took too long, with extended durations beyond expected limits.

Cause

The errors in the kube-apiserver logs indicate that webhook calls are failing, often due to overloaded or unresponsive webhook services. These accumulated failed calls can contribute to increased memory usage and degrade cluster performance, particularly when associated with a specific service URL (e.g., https://gatekeeper-webhook-service.gatekeeper-system.svc:443/ in this example).

Resolution

  1. Identify Affected Webhooks: Use kubectl commands to list validating and mutating webhook configurations associated with the failing service URL. Be thorough, as multiple webhook configurations may be in a degraded state.
  2. Temporarily Remove the Webhook Configuration: If memory and performance issues persist, consider temporarily deleting the affected webhook configuration to alleviate immediate pressure. This action is a temporary measure until the root cause is addressed.
  3. Investigate the Root Cause: The administrator responsible for the webhook should investigate and resolve the cause of these failures. Improving the performance and reliability of the webhook service will help prevent recurrence