Pod failure with error OOMKilled

Products

VMware vCenter Server

Issue/Introduction

Unexpected OOMKilled event on application pod

Environment

TCA 3.2

Resolution

Note: Collect logs immediately after the issue to avoid log rotation.

For Root Cause Analysis (RCA), open a Broadcom Support case and attach the logs listed below.

Log the output of terminal into a file named commands_ouput.log . Upload this log file with the case after running below mentioned commands.
Note:If using putty select printable output only for logging option)

Command Outputs from the Cluster:

kubectl get nodes -A -o wide
Kubectl get pods -A -o wide
kubectl get pods -A -o wide grep <problematic pod name>
kubectl describe pod <problematic pod name> -n <namespace>
kubectl get pod <problematic pod name> -n <namespace> -o yaml
kubectl get pod <problematic pod name> -n <namespace> -o jsonpath='{.spec.containers[*].resources}'
kubectl logs <problematic pod name> -n <namespace> --previous
kubectl logs <problematic pod name> -n <namespace>
kubectl get events -A
kubectl get pods --all-namespaces -o json | jq -r '.items[] | {pod: .metadata.name, namespace: .metadata.namespace, uid: .metadata.uid, containers: .status.containerStatuses[]}'
kubectl top pod -n <problematic pod namespace>

Log into the Node where the Pod was running and collect outputs from Guest OS of the Node:
Inside the TKG node VM where pod was running:
kubectl top node

Now switch to root user:

crictl ps -a
crictl pods
crictl ps -a | grep <problematic pod name>
crictl ps -q
crictl ps -q | xargs -n 1 crictl inspect | grep -E "id|pid"
ctr -n k8s.io containers list
ctr -n k8s.io containers list | grep gwpmm
ps -ef
ps -aux
ps -e -o pid,ppid,user,args
dmesg
dmesg -T
dmesg -T | grep -i oom
dmesg  -T | grep -i kill
journalctl
journalctl -u kubelet
journalctl -u containerd
journalctl --since "48 hours ago"
journalctl --no-pager
cat /var/log/messages
free -h
cat /proc/meminfo
ps aux
df -hT
df -i

Login to the Node where the Pod was scheduled and collect below files:
tar -czvf /tmp/var-log-backup-<Replace-with-actual--name-of-the-Node/VM>-$(date +%Y%m%d).tar.gz /var/log
Login to the control plane Nodes and collect files as in above step:
tar -czvf /tmp/var-log-backup-<Replace-with-actual--name-of-the-Node/VM>-$(date +%Y%m%d).tar.gz /var/log
Collect the vCenter and relevant ESXI hosts support bundles where the control plane and worker node VMs with Problematic Pod were running.
Collect the TCA-M and TCA-CP support bundles, ensuring DB Dump and Kubernetes Logs are also selected.

Pod failure with error OOMKilled

Article ID: 407168

Updated On:

Products

Issue/Introduction

Environment

Resolution

Feedback