The disk-pressure-watch script attempts to reload images when it detects "missing" components. In version 1.23.3 and 1.24.2, it looks for deprecated images like the snapshot-validation-webhook, which is no longer part of the CSI job. This failure triggers a reload of all related CSI images in a continuous loop.
To confirm the issue, you can verify from a worker node the data written by different processes in the worker node.
It can be seen here that the watcher.sh wrote nearly 65 GB, which (considering the update of the worker is about 5 hours) is 12 GB per hour
worker:~# printf "%-10s %-15s %s\n" "PID" "WRITTEN_MB" "COMMAND"
grep -r "write_bytes" /proc/[0-9]*/io 2>/dev/null | sort -nk3 -t: | tail -n 10 | while read -r line; do
pid=$(echo $line | cut -d'/' -f3)
bytes=$(echo $line | awk '{print $NF}')
mb=$((bytes / 1024 / 1024))
cmd=$(ps -p $pid -o comm= 2>/dev/null || echo "unknown")
printf "%-10s %-15s %s\n" "$pid" "${mb}MB" "$cmd"
done
PID WRITTEN_MB COMMAND
......
703 1505MB bosh-agent
8926 64724MB watcher.sh
8779 67390MB containerdAlso in the worker nodes under /var/vcap/sys/log/disk-pressure-watch/disk-pressure-watch.stdout.log the following output can be found:
Missing images detected:
- registry.k8s.io/sig-storage/snapshot-validation-webhook
- cnabu-docker-local.artifactory.eng.vmware.com/pks/cadvisor
- docker.io/antrea/antrea-advanced-agent-debian
- docker.io/antrea/antrea-advanced-controller-debianEach of these images will trigger a different job to upload an image. See KB 382757
For example if snapshot-validation-webhook image is missing the CSI job for image load will be executed.
The related image is not actually found in the job and is never pushed, however the verification trigger image load causing Disk Write utilisation.
Affected version
TKGI 1.23.3
TKGI 1.24.2
An issue has been found in the disk-pressure-watch script, causing this problem.
This has been fixed in the following versions:
Mitigation
Tag Images on the worker nodes
The most common problem would be related to snapshot-validation-webhook. However, if vrops is enabled, other jobs could also trigger image reload. This example is given for this specific image, however the same approach can be used for other images as well if needed. The change is persistent and in case the VM is recreated the tag will remain.
This triggers update on all worker nodes and tags pause image (which is pinned and will not be removed) with another name
bosh -d service-instance_ID ssh worker -c 'sudo /var/vcap/packages/containerd/bin/ctr -n k8s.io images tag projects.registry.vmware.com/tkg/pause:3.10 registry.k8s.io/sig-storage/snapshot-validation-webhook:latest'
As result the script does not detect the image is missing:
Missing images detected:
- cnabu-docker-local.artifactory.eng.vmware.com/pks/cadvisor
- docker.io/antrea/antrea-advanced-agent-debian
- docker.io/antrea/antrea-advanced-controller-debianIn case you see in the disk-pressure logs repeated messages of type seen below engage with support for further assistance:
[Wed Apr 15 07:14:53 UTC 2026] Loading cached container: /var/vcap/packages/.......To revert the change:
bosh -d service-instance_ID ssh worker -c 'sudo /var/vcap/packages/containerd/bin/ctr -n k8s.io images rm registry.k8s.io/sig-storage/snapshot-validation-webhook:latest'
The image will be removed and the reload of all images will start again.
Optional step (stop bleed measure) The service can be stopped to avoid disk write utilisation with:
bosh -d service-instance_ID ssh worker -c 'sudo monit stop disk-pressure-watch'
bosh -d service-instance_ID ssh worker -c 'sudo monit start disk-pressure-watch'