Pod failure with error OOMKilled
search cancel

Pod failure with error OOMKilled

book

Article ID: 407168

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

Unexpected OOMKilled event on application pod

Environment

TCA 3.2

Resolution

Note: Collect logs immediately after the issue to avoid log rotation.

For Root Cause Analysis (RCA), open a Broadcom Support case and attach the logs listed below.

  1. Log the output of terminal into a file named commands_ouput.log . Upload this log file with the case after running below mentioned commands.
    Note:If using putty select printable output only for logging option)
    Command Outputs from the Cluster:
    
    kubectl get nodes -A -o wide
    Kubectl get pods -A -o wide
    kubectl get pods -A -o wide grep <problematic pod name>
    kubectl describe pod <problematic pod name> -n <namespace>
    kubectl get pod <problematic pod name> -n <namespace> -o yaml
    kubectl get pod <problematic pod name> -n <namespace> -o jsonpath='{.spec.containers[*].resources}'
    kubectl logs <problematic pod name> -n <namespace> --previous
    kubectl logs <problematic pod name> -n <namespace>
    kubectl get events -A
    kubectl get pods --all-namespaces -o json | jq -r '.items[] | {pod: .metadata.name, namespace: .metadata.namespace, uid: .metadata.uid, containers: .status.containerStatuses[]}'
    kubectl top pod -n <problematic pod namespace>
    
    Log into the Node where the Pod was running and collect outputs from Guest OS of the Node:
    Inside the TKG node VM where pod was running:
    kubectl top node
    
    Now switch to root user:
    
    crictl ps -a
    crictl pods
    crictl ps -a | grep <problematic pod name>
    crictl ps -q
    crictl ps -q | xargs -n 1 crictl inspect | grep -E "id|pid"
    ctr -n k8s.io containers list
    ctr -n k8s.io containers list | grep gwpmm
    ps -ef
    ps -aux
    ps -e -o pid,ppid,user,args
    dmesg
    dmesg -T
    dmesg -T | grep -i oom
    dmesg  -T | grep -i kill
    journalctl
    journalctl -u kubelet
    journalctl -u containerd
    journalctl --since "48 hours ago"
    journalctl --no-pager
    cat /var/log/messages
    free -h
    cat /proc/meminfo
    ps aux
    df -hT
    df -i
  2. Login to the Node where the Pod was scheduled and collect below files:
    tar -czvf /tmp/var-log-backup-<Replace-with-actual--name-of-the-Node/VM>-$(date +%Y%m%d).tar.gz /var/log
  3. Login to the control plane Nodes and collect files as in above step:
    tar -czvf /tmp/var-log-backup-<Replace-with-actual--name-of-the-Node/VM>-$(date +%Y%m%d).tar.gz /var/log
  4. Collect the vCenter and relevant ESXI hosts support bundles where the control plane and worker node VMs with Problematic Pod were running.
  5. Collect the TCA-M and TCA-CP support bundles, ensuring DB Dump and Kubernetes Logs are also selected.