Pod STATUS got Error due to "KubeletHasDiskPressure kubelet has disk pressure"

Products

VMware vSphere Kubernetes Service Tanzu Kubernetes Grid

Issue/Introduction

Disk pressure in Kubernetes worker nodes can lead to pod evictions and degraded cluster performance. One common cause of disk pressure is the accumulation of exited containers that are not automatically cleaned up. Users experiencing frequent disk pressure issues may seek an automated solution, such as a CronJob or DaemonSet, to prune unused images and prevent recurring storage constraints.

kubectl describe node <node-name> |grep True

DiskPressure True Mon, 24 Mar 2025 10:48:06 +0800 Mon, 24 Mar 2025 10:44:11 +0800 KubeletHasDiskPressure kubelet has disk pressure

kubectl get pod -A | grep -v Running

NAME         READY     STATUS
example-pod  0/1       Error # pod stuck in error state

Environment

vSphere with Tanzu

Cause

Disk pressure occurs when the available storage space on a worker node falls below a predefined threshold. Kubernetes responds by triggering eviction processes, which may affect pod availability. The primary contributors to disk pressure include:

Accumulation of exited containers that are not automatically removed.
Large numbers of unused container images.
Excessive log file growth consuming disk space.
Inefficient resource management leading to excessive storage usage.

Resolution

Currently, Kubernetes does not provide an automated solution using DaemonSets or CronJobs for removing exited containers and reclaiming disk space. However, the issue can be manually resolved by performing the following actions:

## Cleanup System logs under /var/log/ - selectively delete below logs if they take up too much - Not much will be reclaimed.
# cd /var/log
# ls *.log | xargs truncate -s 0 
## Note, due to Processes still holding the logs open, there won't be immediate return of disk space. 

## Clean up Pod Logs - Delete archived Pod logs
# cd /var/log/pods
# find . | grep '.gz | xargs rm
# find . | grep '.log\.' | xargs rm

## Trimming Journal Logs
# journalctl --disk-usage
  Archived and active journals take up 328.0M in the file system.
# journalctl --vacuum-size=200M   ## Much space can be reclaimed.
Deleted archived journal /var/log/journal/2b6c5e2d02d4482085a6953ad2e0c450/system@013c56251582472993eb1ee01f4a8622-0000000000000001-0006274c4177688a.journal (128.0M).
Vacuuming done, freed 128.0M of archived journals from /var/log/journal/2b6c5e2d02d4482085a6953ad2e0c450.
Vacuuming done, freed 0B of archived journals from /run/log/journal.
Vacuuming done, freed 0B of archived journals from /var/log/journal.

# journalctl --disk-usage
Archived and active journals take up 200.0M in the file system.

## Clean up Container Images
# crictl images ## Identify old container images not being used any more.
# crictl rmi <image-id>; crictl rmi --prune           ## Delete images & claim the disk space.

## Delete Exitied Containers
# crictl ps -a -q --state exited | xargs crictl rm   ## Not much space will be reclaimed

Additional Information

Preventive Measures

To mitigate disk pressure issues proactively, consider implementing the following best practices:

Identify the Source of Disk Pressure
- Use monitoring tools to analyze storage consumption and identify workloads contributing to disk pressure.
Optimize Resource Usage
- Scale down unnecessary workloads.
- Adjust resource limits for more efficient storage utilization.
- Use lightweight container images where possible.
Manage Log Files
- Ensure logs are properly rotated and archived.
- Configure log retention policies to prevent excessive disk usage.
Regularly Clean Up Unused Resources
- Implement periodic cleanup of unused containers, images, and other unnecessary resources.
Increase Disk Capacity
- If disk pressure persists despite optimizations, consider expanding existing disk volumes or adding additional storage to worker nodes.

By following these best practices, users can minimize disk pressure issues and maintain a stable Kubernetes environment