A critical alert (CPU Load >90% for 10+ minutes) triggered on an single Diego Cell.
The clamscan.log file recorded hundreds of filesystem errors similar to the following:
/var/vcap/data/grootfs/store/unprivileged/images/.../diff/home/vcap/app/BOOT-INF/lib/...: File path check failure: No such file or directory. ERROR
/var/vcap/data/grootfs/store/unprivileged/images/.../diff/home/vcap/app/BOOT-INF/lib/...: Can't open file or directory ERROR
Product: Tanzu Application Service (TAS) / Operations Manager
Impacted Service: Container Filesystem (GrootFS)
Security Software: Anti-Virus for VMware Tanzu (ClamAV)
A condition occurs between the static scheduled scan of ClamAV and the dynamic lifecycle of application containers managed by GrootFS.
When ClamAV runs a scheduled scan, it indexes and attempts to inspect files inside the ephemeral container filesystem layers located under /var/vcap/data/grootfs/store/unprivileged/images/. Because container instances on Diego cells are frequently spun up, rescaled, or garbage-collected by the platform, these filesystem layers can be torn down while ClamAV is actively attempting to read them. This causes ClamAV to repeatedly fail to open files that existed moments prior, resulting in an automated logging loop of non-fatal path errors, filesystem crawling overhead, and a spike in CPU utilization that can last long enough to throw a high CPU error.
To eliminate the CPU performance bottleneck, configure ClamAV to exclude the dynamic GrootFS store directory. Container images should instead be scanned upstream during the pipeline build phase or via secure base buildpacks.
Log in to the Ops Manager Installation Dashboard.
Click on the Anti-Virus for VMware Tanzu (Compliance Scanner) tile.
Navigate to the Scan Configuration section.
Locate the Exclusions / Paths to Exclude field (Ignore directories setting).
Add the following path to the exclusion list: /var/vcap/data/grootfs/
Click Save.
Return to the Ops Manager Dashboard, click Review Pending Changes, ensure the Anti-Virus tile is selected, and click Apply Changes to roll out the updated configuration across all Diego Cells.
This issue manifests as short-lived or sustained peaks that typically auto-resolve once the file system traversal completes or the scan finishes, meaning bosh recreate or manual process remediation is rarely required unless the cell becomes totally unresponsive.