Pods Stuck in CrashLoopBackOff Due to Node Resource Imbalance After Worker Node Failure

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

While Kubernetes typically performs balanced scheduling, it can face issues under load, particularly following node failures. The problem occurs because the scheduler does not automatically rebalance the pods across nodes when a node goes down and a new one is brought online. This results in resource contention, causing Out of Memory (OOM) errors.

When a worker node goes down in a Kubernetes cluster, the pods scheduled on that node remain unscheduled. Kubernetes will attempt to reschedule them onto the remaining nodes, but this process is often inefficient, leading to resource imbalances. As a result, certain pods, such as the PostgreSQL pod or anyother, can become stuck in the "CrashLoopBackOff" state due to insufficient resources on the other nodes.

Symptoms:

There may be several symptoms, depending on which pods are in the "CrashLoopBackOff" state.

If the pods belong to a specific service, alarms related to that service can appear on the SSP Alarms page.
Node-specific alarms such as "Security Services Platform Node CPU High", "Security Services Platform Node Memory High", "Security Services Platform Node Disk High", or "Security Services Platform Node Status Down" may be triggered. Service-specific alarms like "Service Degraded" or "Service Down" may also appear.
In the SSP-I (Installer UI), under the "Troubleshooting" tab, the "Health Chart" may show some services in "Red". When you navigate to the "Components Grid" in the same tab, the "Overall Score" for the component "Overall" will be less than 100. The expected score should be 100, which represents a stable state.

For example, in the scenario described below under the "Cause" section, where the PostgreSQL pod was not coming up, you might observe the symptoms displayed in the screenshots below:

The service/namespace "nsxi-platform" turns to red color.

Component "OVERALL" showing "Overall Score" less than 100:

When scrolled down to identify the individual components, "NSXI-PLATFORM" shows the "Overall Score" "0"

Environment

vDefend SSP >= 5.0

Cause

The Kubernetes scheduler does not automatically rebalance resources across nodes after a node failure or when a new node is added during infrastructure scale-out. This results in some pods being scheduled on nodes with insufficient resources, causing them to crash.

Example:

In a recent evaluation setup with two worker nodes, the following behavior was observed:

One worker node was deleted, and a new worker node was spawned to ensure that the cluster maintained the required two worker nodes.
During the transition period when the old node was deleted and the new node had not yet reached the "running" state, Kubernetes’ scheduler preemptively scheduled higher-priority workloads, like PostgreSQL, on the single remaining node.
Due to limited resources on this node, the PostgreSQL pod encountered an Out of Memory (OOM) error and entered a "CrashLoopBackOff" state.
Once the new worker node became available, the resource consumption across the nodes did not rebalance automatically. The pods that were scheduled on the original node did not move to the new node, despite the new node having underutilized resources.

Kubernetes Node Resources Before Scaling:

> k top nodes

NAME                            CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
mr027860-vsx-md-0-xdvrk-npgbq   724m         4%     8171Mi          12%
mr027860-vsx-md-0-xdvrk-zmxm6   8389m        52%    28876Mi         45%
mr027860-vsx-qcdzr              537m         13%    2882Mi          37%

In the output above, mr027860-vsx-qcdzr is the control plane node. Among the worker nodes, mr027860-vsx-md-0-xdvrk-zmxm6 is overutilized, with CPU usage at 52% and memory usage at 45%.

This imbalance occurred because the scheduler failed to dynamically rebalance resources when the new node came online.

This issue highlights the problem where Kubernetes does not redistribute the pods efficiently after a node is deleted and a new one is added. As a result, one node becomes overutilized, causing resource-intensive pods, like PostgreSQL, to crash.

While this example describes node scaling, a similar issue can occur when an ESX node reboots or crashes, affecting Kubernetes worker VMs running on that node.

Resolution

To resolve the issue, you first need to identify the over-utilized node. You can do this by logging into SSP and navigating to System > Platform & Services > Resources.

Alternatively, you can log into the CLI of the SSP-Installer appliance with root credentials and run the following command:

k top nodes

This will display CPU and memory utilization across nodes, allowing you to identify which node is consuming the most resources.

Once you've identified the over-utilized node, you can trigger the Kubernetes scheduler to reprocess the pods on that node. There are a couple of ways to do this, listed below in order of the disruption they cause, from least to most:

Option 1: Scale Down the Impacted Application to 0 Replicas and Then Scale Back Up

Log in to the SSP-Installer appliance with root credentials and identify the pods in the "CrashLoopBackOff" state by running the command:

k get pods -A | grep -v Running | grep -v Completed
Note the namespace (first column) and pod name (second column) from the command output.
Describe the pod with the following command:

k describe pod <pod-name-noted-from-above> -n <namespace-noted-from-above>
Look for the "Controlled By" field in the output and note whether it’s a StatefulSet/Deployment/ReplicaSet/DaemonSet. The format will be Set-to-which-pod-belongs/Name-of-the-set, such as:

Example: Controlled By: StatefulSet/postgresql-ha-postgresql
Identify the expected replica count by running:

k edit <Deployment/StatefulSet/ReplicaSet/DaemonSet-noted-from-controlled-by-output-above> <name-of-the-set-noted-from-controlled-by-field> -n <namespace-of-the-pod-noted-above>
In the edit mode, locate the "replicas" field under the spec section. For example:

spec:
progressDeadlineSeconds: 600
replicas: 2 #<<<<Note this
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/component: pgpool
app.kubernetes.io/instance: nsxi-platform
app.kubernetes.io/name: postgresql-ha
In the above output, it says replicas are "2". It means the set is configured to have "2" replicas
Now reduce the replicas to "0" by editing the file - The editor is a simple vi editor.
Validate that the pods in the "CrashLoopBackOff" state are terminated by running:

k get pods -A | grep "pod-name-that-was-in-crashLoop"
Increase the replicas back to the original number (e.g., 2) by editing back the same set as done in Step 5 above and validate whether all the pods are back up in the "Running" state:

k get pods -A

This approach limits disruption to the affected service and only impacts the dependent services, while triggering Kubernetes to reschedule the pods on available resources.

Option 2: Delete the Worker Node with Higher Utilization

Steps for Deleting the Node:

Log in to the SSP-Installer appliance with root credentials.
Delete the worker node with high resource utilization by executing:

k -n nsxi-platform delete node <node-name>

For the example utilization shown in the "Cause" section, the command would be:

k -n nsxi-platform delete node mr027860-vsx-md-0-xdvrk-zmxm6

This will trigger the Kubernetes scheduler to reassess the resource requirements of the pods on the highly utilized node, helping balance the resources as desired.

Note:

All commands should be executed from the SSP Installer VM as the alias is already configured to access kubectl for the SSP cluster:

alias k='/config/clusterctl/k_alias_wrapper.sh'

Additional Information

You can directly Scale the Replicas if the Expected Count is Known after Step 6 in the Resolution field above by running below:

For scaling down:

k scale <Deployment/StatefulSet/ReplicaSet/DaemonSet-noted-from-controlled-by-output-above> <name-of-the-set-noted-from-controlled-by-field> -n <namespace-of-the-pod-noted-above> --replicas=0

Validate if the pods are terminated by running

k get pods -A | grep <pod-name-that-was-in-crashLoop>

For scaling back up:

k scale <Deployment/StatefulSet/ReplicaSet/DaemonSet-noted-from-controlled-by-output-above> <name-of-the-set-noted-from-controlled-by-field> -n <namespace-of-the-pod-noted-above> --replicas=<original-value>

This will trigger the Kubernetes scheduler to reprocess the pods based on the updated resource availability, forcing them to be scheduled according to the current node utilization.

For the example described in the "Cause," section above, the commands mentioned below would help achieve the same:

k scale deployment postgresql-ha-pgpool -n nsxi-platform --replicas=0
k scale deployment postgresql-ha-pgpool -n nsxi-platform --replicas=2