Kubernetes pods (CNFs) in Telco Cloud Automation (TCA) placed in Evicted state due Node Disk Pressure
search cancel

Kubernetes pods (CNFs) in Telco Cloud Automation (TCA) placed in Evicted state due Node Disk Pressure

book

Article ID: 325367

calendar_today

Updated On:

Products

VMware VMware Telco Cloud Automation

Issue/Introduction

The purpose of this article is to provide troubleshooting guidelines for scenarios where Kubernetes pods go into an evicted state due to disk pressure.

Symptoms:
  1. Kubernetes pods terminate and enter an evicted state.
  2. When running kubectl get events -n namespace, the following errors are observed:
    1. failed to garbage collect required amount of images
    2. disk-pressure  warnings for the associated namespace.
  3. When running $kubectl describe pod -n namespace podname, the following errors are observed:
    1. NodeHasDiskPressure
    2. Attempting to reclaim ephemeral-storage


Environment

VMware Telco Cloud Automation 2.0.1
VMware Telco Cloud Automation 2.1.1
VMware Telco Cloud Automation 1.9
VMware Telco Cloud Automation 1.8
VMware Telco Cloud Automation 2.1
VMware Telco Cloud Automation 2.0
VMware Telco Cloud Automation 1.x
VMware Telco Cloud Automation 1.9.5

Cause

In Kubernetes, Pods can be evicted from a Node due to insufficient resources.

In additional to terminating the Pod, whenever a node experiences disk pressure, a process called Node-pressure Eviction can activate, which utilizes Kubelet to perform garbage collection and remove dormant Kubernetes objects from utilizing resources.

When a pod is terminated, Kubernetes can generate several core* temporary files, which if not cleaned up properly, can lead to disk exhaustion.

While this process is automated, manual intervention may be required.

Resolution

No permanent resolution.

Workaround:

Procedure 1 – Clean up corefiles

  1. SSH into the worker node as the root user.
  2. Obtain the file system disk usage by running the following command:

    $df -kh

     
  3. Confirm that the root (/) partition is highly utilized, e.g. over 85% full.
  4. Navigate to the /data/storage/corefiles directory

    $cd /data/storage/corefiles

     
  5. Obtain the total size of the directory by running the following command:

    $du -s -h

    Note: This value is the amount of space that will be cleaned up.

     
  6. List the files to confirm there are corefiles present.

     $ls -lrth

     
  7. Run the following command to remove all corefiles:

    $rm -rf core*

     
  8. Review the pod status by running the following command:

    $kubectl get pods -A -o wide | grep nodename

    Note: Replace nodename in example above with valid nodename (CNF name).

     
  9. Confirm the Pods are in a running state as expected
If issue is not resolved, please proceed to Procedure 2.
 

Procedure 2 – Clean up and re-instantiate CNF(s)

  1. SSH to the worker node as root user
  2. Run the following command to confirm the problematic container is in a running state:

    $crictl ps -a

     
  3. Obtain the imageId name of the problematic container by running the following command.

    $crictl images

     
  4. Run the following command to immediately stop and remove the problematic container:

    $ crictl stop imageId ; crictl rm imageId

    Note: Replace imageId in example above with the IMAGE ID of the IMAGE associated to the problematic container.
    Note: These commands must be run in parallel to avoid new containers from being built to compensate for the container that has stopped.
     
  5. Once the container has been terminated, terminate the CNF via the TCA UI.
    After the CNF has been terminated, run the following command to remove the image:

    $crictl rmi imageId

    Note: Replace imageId in example above with the IMAGE ID of the IMAGE associated to the problematic container.
     
  6. From the Master node, run the following command to confirm that the problematic container(s) have been terminated.
     
    $kubectl get pods -A -o wide | grep nodename

    Note: Replace nodename in example above with a valid nodename (CNF name).
    Note: Results should show only kube-system pods.

     
  7. Re-instantiate CNFs from TCA.
     
  8. Once the instantiation has completed, run the following command to ensure the pods are in a Running state:

    $kubectl get pods -A -o wide | grep nodename

    Note: Replace nodename in example above with valid nodename (CNF name)
If multiple evicted pods still exist, please proceed with Procedure 3.
 

Procedure 3 – Delete all pods in an Evicted state

  1. Run the following command to delete any remaining Pods in an Evicted state:

    $kubectl get pods | grep Evicted | awk ‘{print $1}’ | xargs kubectl delete pod
     


Additional Information

Impact/Risks:
Impacts all versions of Telco Cloud Automation (TCA).