When using Healthwatch, checking the disk space for the TSBD VMs shows 100% (or extremely high) disk usage.
Running the following commands on the TSDB VM:
shows that the directory /var/vcap/store/prometheus/wal/ is utilizing the majority of the disk space.
Healthwatch
In Prometheus, the Write-Ahead Log (WAL) acts as a temporary file for incoming data. When Prometheus receives new metrics, it doesn't immediately write them to the permanent data blocks on the disk. Instead, it follows the process:
It records the incoming data in the WAL files.
It stores the data in RAM for quick querying.
After some time, Prometheus "compacts" the data from the WAL files into a permanent, read-only block.
Given that the WAL files are continuing to grow suggests that the compaction is not happening.
To resolve this, there are a few options:
monit restart prometheus -> This may trigger prometheus to compress the wal files. Although given that the disk space is maxed, it may lead to the process failing to start.
If possible, you can bump the disk up to a large size temporarily. This will get the process back to a running state. To increase the disk size, go to Ops Manager -> Healthwatch tile -> Resource config.
One last option if neither of the above are possible, is to delete the files (not folders) in the wal directory. This may lead to some data loss, as those wal files are meant to be temporary data storage.
To investigate why the wal files may not be compacting, please raise a case with Broadcom Support.