Upgrade to 1.23.3 or 1.24.2 leads to high disk utilisation due to disk-pressure-watcher service
search cancel

Upgrade to 1.23.3 or 1.24.2 leads to high disk utilisation due to disk-pressure-watcher service

book

Article ID: 437011

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

The disk-pressure-watch script attempts to reload images when it detects "missing" components. In version 1.23.3 and 1.24.2, it looks for deprecated images like the snapshot-validation-webhook, which is no longer part of the CSI job. This failure triggers a reload of all related CSI images in a continuous loop.

To confirm the issue, you can verify from a worker node the data written by different processes in the worker node.

It can be seen here that the watcher.sh wrote nearly 65 GB, which (considering the update of the worker is about 5 hours) is 12 GB per hour

worker:~# printf "%-10s %-15s %s\n" "PID" "WRITTEN_MB" "COMMAND"

grep -r "write_bytes" /proc/[0-9]*/io 2>/dev/null | sort -nk3 -t: | tail -n 10 | while read -r line; do
    pid=$(echo $line | cut -d'/' -f3)
    bytes=$(echo $line | awk '{print $NF}')
    mb=$((bytes / 1024 / 1024))
    cmd=$(ps -p $pid -o comm= 2>/dev/null || echo "unknown")
    printf "%-10s %-15s %s\n" "$pid" "${mb}MB" "$cmd"
done

PID        WRITTEN_MB      COMMAND
......
703        1505MB          bosh-agent
8926       64724MB         watcher.sh
8779       67390MB         containerd

Also in the worker nodes under /var/vcap/sys/log/disk-pressure-watch/disk-pressure-watch.stdout.log the following output can be found:

Missing images detected:
- registry.k8s.io/sig-storage/snapshot-validation-webhook
- cnabu-docker-local.artifactory.eng.vmware.com/pks/cadvisor
- docker.io/antrea/antrea-advanced-agent-debian
- docker.io/antrea/antrea-advanced-controller-debian

Each of these images will trigger a different job to upload an image. See KB 382757

For example if snapshot-validation-webhook image is missing the CSI job for image load will be executed.

The related image is not actually found in the job and is never pushed, however the verification trigger image load causing Disk Write utilisation.

Environment

Affected version 

TKGI 1.23.3

TKGI 1.24.2

 

Cause

An issue has been found in the disk-pressure-watch script, causing this problem.

Resolution

This has been fixed in the following versions:

 

Mitigation 

Tag Images on the worker nodes

The most common problem would be related to snapshot-validation-webhook. However, if vrops is enabled, other jobs could also trigger image reload. This example is given for this specific image, however the same approach can be used for other images as well if needed. The change is persistent and in case the VM is recreated the tag will remain.

This triggers update on all worker nodes and tags pause image (which is pinned and will not be removed) with another name 

bosh -d service-instance_ID ssh worker -c 'sudo /var/vcap/packages/containerd/bin/ctr -n k8s.io images tag projects.registry.vmware.com/tkg/pause:3.10 registry.k8s.io/sig-storage/snapshot-validation-webhook:latest'

As result the script does not detect the image is missing:

Missing images detected:
- cnabu-docker-local.artifactory.eng.vmware.com/pks/cadvisor
- docker.io/antrea/antrea-advanced-agent-debian
- docker.io/antrea/antrea-advanced-controller-debian

In case you see in the  disk-pressure logs repeated messages of type seen below engage with support for further assistance:

[Wed Apr 15 07:14:53 UTC 2026] Loading cached container: /var/vcap/packages/.......

To revert the change:

bosh -d service-instance_ID ssh worker -c 'sudo /var/vcap/packages/containerd/bin/ctr -n k8s.io images rm registry.k8s.io/sig-storage/snapshot-validation-webhook:latest'

The image will be removed and the reload of all images will start again. 

Optional step (stop bleed measure) The service can be stopped to avoid disk write utilisation with:

bosh -d service-instance_ID ssh worker -c 'sudo monit stop disk-pressure-watch'

bosh -d service-instance_ID ssh worker -c 'sudo monit start disk-pressure-watch'