While Kubernetes typically performs balanced scheduling, it can face issues under load, particularly following node failures. The problem occurs because the scheduler does not automatically rebalance the pods across nodes when a node goes down and a new one is brought online. This results in resource contention, causing Out of Memory (OOM) errors.
When a worker node goes down in a Kubernetes cluster, the pods scheduled on that node remain unscheduled. Kubernetes will attempt to reschedule them onto the remaining nodes, but this process is often inefficient, leading to resource imbalances. As a result, certain pods, such as the PostgreSQL pod or anyother, can become stuck in the "CrashLoopBackOff" state due to insufficient resources on the other nodes.
Symptoms:
There may be several symptoms, depending on which pods are in the "CrashLoopBackOff" state.
For example, in the scenario described below under the "Cause" section, where the PostgreSQL pod was not coming up, you might observe the symptoms displayed in the screenshots below:
The service/namespace "nsxi-platform" turns to red color.
Component "OVERALL" showing "Overall Score" less than 100:
When scrolled down to identify the individual components, "NSXI-PLATFORM" shows the "Overall Score" "0"
vDefend SSP >= 5.0
The Kubernetes scheduler does not automatically rebalance resources across nodes after a node failure or when a new node is added during infrastructure scale-out. This results in some pods being scheduled on nodes with insufficient resources, causing them to crash.
Example:
In a recent evaluation setup with two worker nodes, the following behavior was observed:
Kubernetes Node Resources Before Scaling:
> k top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
mr027860-vsx-md-0-xdvrk-npgbq 724m 4% 8171Mi 12%
mr027860-vsx-md-0-xdvrk-zmxm6 8389m 52% 28876Mi 45%
mr027860-vsx-qcdzr 537m 13% 2882Mi 37%
In the output above, mr027860-vsx-qcdzr is the control plane node. Among the worker nodes, mr027860-vsx-md-0-xdvrk-zmxm6 is overutilized, with CPU usage at 52% and memory usage at 45%.
This imbalance occurred because the scheduler failed to dynamically rebalance resources when the new node came online.
This issue highlights the problem where Kubernetes does not redistribute the pods efficiently after a node is deleted and a new one is added. As a result, one node becomes overutilized, causing resource-intensive pods, like PostgreSQL, to crash.
While this example describes node scaling, a similar issue can occur when an ESX node reboots or crashes, affecting Kubernetes worker VMs running on that node.
To resolve the issue, you first need to identify the over-utilized node. You can do this by logging into SSP and navigating to System > Platform & Services > Resources.
Alternatively, you can log into the CLI of the SSP-Installer appliance with root credentials and run the following command:
k top nodes
This will display CPU and memory utilization across nodes, allowing you to identify which node is consuming the most resources.
Once you've identified the over-utilized node, you can trigger the Kubernetes scheduler to reprocess the pods on that node. There are a couple of ways to do this, listed below in order of the disruption they cause, from least to most:
Option 1: Scale Down the Impacted Application to 0 Replicas and Then Scale Back Up
k get pods -A | grep -v Running | grep -v Completed
k describe pod <pod-name-noted-from-above> -n <namespace-noted-from-above>
Example: Controlled By: StatefulSet/postgresql-ha-postgresql
k edit <Deployment/StatefulSet/ReplicaSet/DaemonSet-noted-from-controlled-by-output-above> <name-of-the-set-noted-from-controlled-by-field> -n <namespace-of-the-pod-noted-above>
spec:
progressDeadlineSeconds: 600
replicas: 2 #<<<<Note this
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/component: pgpool
app.kubernetes.io/instance: nsxi-platform
app.kubernetes.io/name: postgresql-ha
In the above output, it says replicas are "2". It means the set is configured to have "2" replicas
Validate that the pods in the "CrashLoopBackOff" state are terminated by running:k get pods -A | grep "pod-name-that-was-in-crashLoop"
k get pods -A
This approach limits disruption to the affected service and only impacts the dependent services, while triggering Kubernetes to reschedule the pods on available resources.
Option 2: Delete the Worker Node with Higher Utilization
Steps for Deleting the Node:
k -n nsxi-platform delete node <node-name>
For the example utilization shown in the "Cause" section, the command would be:
k -n nsxi-platform delete node mr027860-vsx-md-0-xdvrk-zmxm6
This will trigger the Kubernetes scheduler to reassess the resource requirements of the pods on the highly utilized node, helping balance the resources as desired.
Note:
All commands should be executed from the SSP Installer VM as the alias is already configured to access kubectl for the SSP cluster:alias k='/config/clusterctl/k_alias_wrapper.sh'
You can directly Scale the Replicas if the Expected Count is Known after Step 6 in the Resolution field above by running below:
For scaling down:
k scale <Deployment/StatefulSet/ReplicaSet/DaemonSet-noted-from-controlled-by-output-above> <name-of-the-set-noted-from-controlled-by-field> -n <namespace-of-the-pod-noted-above> --replicas=0
Validate if the pods are terminated by running
k get pods -A | grep <pod-name-that-was-in-crashLoop>
For scaling back up:
k scale <Deployment/StatefulSet/ReplicaSet/DaemonSet-noted-from-controlled-by-output-above> <name-of-the-set-noted-from-controlled-by-field> -n <namespace-of-the-pod-noted-above> --replicas=<original-value>
This will trigger the Kubernetes scheduler to reprocess the pods based on the updated resource availability, forcing them to be scheduled according to the current node utilization.
For the example described in the "Cause," section above, the commands mentioned below would help achieve the same:
k scale deployment postgresql-ha-pgpool -n nsxi-platform --replicas=0
k scale deployment postgresql-ha-pgpool -n nsxi-platform --replicas=2