API Gateway: Container Gateway running on the Kubernetes platform is booting in a loop / not completing boot.
search cancel

API Gateway: Container Gateway running on the Kubernetes platform is booting in a loop / not completing boot.

book

Article ID: 198286

calendar_today

Updated On:

Products

CA API Gateway CA Microgateway

Issue/Introduction

This article will discuss the issue of the API Gateway container booting in a loop (thus never successfully completing a full boot), when running on Kubernetes. When hosting the Gateway container in Kubernetes with health-checks (i.e. liveness probe, readiness probe, etc.), if the probes are checking too quickly before the Gateway container has completed booting, Kubernetes will recreate the pod as it assumes the original one failed. This results in a continuous booting loop, with the Gateway never finishing to boot so that it can accept traffic and work as intended.

Let's consider the following scenario:

  • Gateway container takes 3 minutes to complete a boot.
  • Kubernetes liveness and readiness probes are configured to start checking at 2 minutes, with a 5 second check interval and requiring 3 failures before marking it down and spinning up a new container.

In that scenario, the Gateway container will be in a continuous boot loop since the checks will only allow the Gateway to boot fully in 2 minutes and 15 seconds (5 second interval * 3 failures), but in this case it requires at least 3 minutes to complete the boot process.

Environment

This article applies to all supported Gateway versions where the Kubernetes platform is also supported.

Cause

The root cause is the probes configured in the Kubernetes environment for the container are set too low. If, for example, the container Gateway takes 3 minutes to boot all of the policies and other data needed for it to run successfully, but the Kubernetes health check probes are configured to start checking just 1 minute in, it will mark the container as down and spin up a new one, resulting in the continuous boot loop.

Resolution

At a high level... Make sure the the YAML file isn't configured to start the probe checks until after the expected boot time of the container Gateway. 

The recommended solution is to do either one of the following:

  • Remove probes all together if they are not required.
  • If probes are required for the environment, then temporarily remove probes to allow the container Gateway to complete it's booting process, and take note of the time it requires to boot. Once the boot time is known, then set the probes to start about 1-2 minutes after that expected boot time. We recommend a buffer in case the system is a bit slower than normal one day or there are a lot more policies after a GMU import for example, to avoid having to constantly update that value if the policy counts are changing frequently.
    • Specifically, if the Gateway takes 3 minutes to boot for example, then the initialDelaySeconds value on all of the probes should be at the very least 180 seconds (i.e. 3 minutes), but ideally would be closer to 4 or 5 minutes, meaning the initialDelaySeconds value should be upwards of 300 (assuming the boot time is about 3 minutes).

Additional Information

From the Kubernetes documentation on probes, these are important values to understand when configuring them on the YAML file for the container Gateway, and they should be understood before implementing them:

initialDelaySeconds: Number of seconds after the container has started before liveness or readiness probes are initiated. Defaults to 0 seconds. Minimum value is 0.
periodSeconds: How often (in seconds) to perform the probe. Default to 10 seconds. Minimum value is 1.
timeoutSeconds: Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1.
successThreshold: Minimum consecutive successes for the probe to be considered successful after having failed. Defaults to 1. Must be 1 for liveness. Minimum value is 1.
failureThreshold: When a probe fails, Kubernetes will try failureThreshold times before giving up. Giving up in case of liveness probe means restarting the container. In case of readiness probe the Pod will be marked Unready. Defaults to 3. Minimum value is 1.

Please read more on the Kubernetes Probes feature using their documentation here: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/