Diego cells may enter a failing state due to a failure in bosh-dns. Logs will show the following types of errors:
[ForwardHandler] 2025-12-24T18:45:21.794079093Z ERROR - error recursing for example.dev.customer.int. to "10.###.##.##:53": dial udp 10.###.##.##:53: connect: resource temporarily unavailable
Essentially, all dns resolution is failing on this single diego_cell. Even for the droplet downloads. So because this diego_cell had a lot of available resources, it was winning auctions, only to receive the LRP and fail to download droplet, because it could not resolve the download URL
2025-12-24T13:51:10.63+0000 [API/2] OUT App instance exited with guid 428c8e2c-############### payload: {"instance"=>"740f238c-####-####-####-####", "index"=>0, "cell_id"=>"dc46da2b-####-####-####-05eb91f37f3e", "reason"=>"CRASHED", "exit_description"=>"failed to download cached artifacts", "crash_count"=>1, "crash_timestamp"=>1766584270617715107, "version"=>"d95ddd42-d586-4571-b55d-412eedcfae11"}
Upon research, all of the available ephemeral udp ports were consumed by grootfs.
grootfs clean is continuously called, but is unable to obtain the lock file
/var/vcap/data/grootfs/store/unprivileged/locks/global-groot-lock.lock
Every time grootfs clean is called, a udp connection between grootfs and loggr-udp-forwarder is opened. Because the lock file is never released, but clean is continuously called - new udp connections build up. Eventually, this exhausts the available port range and no other process can use udp, like bosh-dns.
The command below will show the count of in-use udp ports:
diego_cell/00727a09-###-###-###-#########:/var/vcap/sys/log/garden# netstat -anutlpe | grep grootfs | wc -l
9123
ps aux sorting by start date for grootfs:
diego_cell/00727a09-###-###-###-#########:/var/vcap/sys/log/garden# ps -eo pid,lstart,cmd --sort=lstart | grep grootfs | head
3043090 Fri Dec 19 08:08:11 2025 /var/vcap/packages/grootfs/bin/grootfs --config /var/vcap/jobs/garden/config/grootfs_config.yml create /var/vcap/packages/cflinuxfs4/rootfs.tar 5b9d6754-16e0-4252-4ec1-c116-envoy-startup-healthcheck-0
3043136 Fri Dec 19 08:08:11 2025 /var/vcap/packages/grootfs/bin/grootfs --log-file /var/vcap/sys/log/garden/groot.clean.log --log-level info --log-timestamp-format rfc3339 --store /var/vcap/data/grootfs/store/unprivileged --metron-endpoint 127.0.0.1:3457 --tardis-bin /var/vcap/packages/grootfs/bin/tardis --newuidmap-bin /var/vcap/packages/garden-idmapper/bin/newuidmap --newgidmap-bin /var/vcap/packages/garden-idmapper/bin/newgidmap clean --threshold-bytes 30913658880
The first clean is blocked by waiting on file lock:
diego_cell/00727a09-###-###-###-#########:/var/vcap/sys/log/garden# strace -ff -p 3043136
strace: Process 3043136 attached with 11 threads
[pid 3857609] epoll_pwait(4, <unfinished ...>
[pid 3842104] futex(0xc000101948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 3842102] futex(0xc000101148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 3842100] futex(0xc0002ed948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 3043184] futex(0xc000100948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 3043153] futex(0x10a5138, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 3043149] futex(0xc000100148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 3043147] futex(0xc000075948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 3043145] futex(0xc000075148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 3043143] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid 3043136] flock(11, LOCK_EX <unfinished ...>
FD 11 is a global lock file:
diego_cell/00727a09-###-###-###-#########:/var/vcap/sys/log/garden# ls -l /proc/3043136/fd/11
l-wx------ 1 root root 64 Dec 29 15:03 /proc/3043136/fd/11 -> /var/vcap/data/grootfs/store/unprivileged/locks/global-groot-lock.lock
lsof of the lock file:
diego_cell/00727a09-###-###-###-#########:/var/vcap/sys/log/garden# lsof /var/vcap/data/grootfs/store/unprivileged/locks/global-groot-lock.lock
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
grootfs 3043090 root 6wR REG 7,0 0 201326722 /var/vcap/data/grootfs/store/unprivileged/locks/global-groot-lock.lock
grootfs 3043136 root 11w REG 7,0 0 201326722 /var/vcap/data/grootfs/store/unprivileged/locks/global-groot-lock.lock
PID 3043090 holds the lock:
diego_cell/00727a09-###-###-###-#########:/var/vcap/sys/log/garden# ps aux | grep 3043090
root 3043090 0.0 0.0 1905272 15084 ? Sl Dec19 0:41 /var/vcap/packages/grootfs/bin/grootfs --config /var/vcap/jobs/garden/config/grootfs_config.yml create /var/vcap/packages/cflinuxfs4/rootfs.tar 5b9d6754-16e0-4252-4ec1-c116-envoy-startup-healthcheck-0
Once the process is killed and exited, upd ports start to release:
diego_cell/00727a09-###-###-###-#########:/var/vcap/sys/log/garden# netstat -anutlpe | grep grootfs | wc -l
7594
diego_cell/00727a09-###-###-###-#########:/var/vcap/sys/log/garden# netstat -anutlpe | grep grootfs | wc -l
7548
diego_cell/00727a09-###-###-###-#########:/var/vcap/sys/log/garden# netstat -anutlpe | grep grootfs | wc -l
7115
diego_cell/00727a09-###-###-###-#########:/var/vcap/sys/log/garden# netstat -anutlpe | grep grootfs | wc -l
5447
diego_cell/00727a09-###-###-###-#########:/var/vcap/sys/log/garden# netstat -anutlpe | grep grootfs | wc -l
2851
A grootfs cleaning process is kicked off every few minutes. That cleaning process claims a lock, so only one cleaning process is run at a time. The cleaning process should take less than 1 min but that cleaning process never exits for days. Every few minutes, more cleaning processes are started and waiting for the lock. These waiting cleaning processes can use too much CPU and, in this case, UDP ports as well.
Recreating the VM resolves the issue as a temporary workaround. However, this issue is fixed in garden v1.78.0+. The recommendation is to upgrade to a version that contains garden v1.78.0+
https://github.com/cloudfoundry/garden-runc-release/releases/tag/v1.78.0