Two major causes could result in high disk usage on the BOSH director VM:
1. BOSH director clients such as Prometheus and/or scripts could generate new tasks faster than BOSH director removes old tasks. As a result, the BOSH database (PostgreSQL) tasks table keeps growing. In reverse, it causes further slowness and more tasks filling up in the table. There were improvements in recent BOSH releases, the major fix is available in BOSH v271, which is shipped with Ops Manager v2.10.4.
2. When BOSH director removes data from Postgres database, the disk space is not reclaimed by Postgres. The autovacuum daemon removes dead row versions in tables and indexes and marks the space available for future reuse. However, it will not return the space to the operating system. As result, database files size could keep growing more than it is needed due to additional index and TOAST storage. Please refer to the Postgres document Routine Vacuuming for further details.
On BOSH director VM, 3 major directories would consume high disk space in persistent disk:
Important note: One operation mistake can result in a unrecoverable situation and it is highly recommended that you engage Tanzu support before attempting this procedure.
To prevent BOSH director VM from running out of disk space or to resolve any performance or disk issues already occurring, we recommend:
1. Reducing the workload placed on BOSH director. As we know, Prometheus and some scripts may send a volume of API requests, higher than necessary, to the BOSH director. Please review and reduce the requests to necessary levels, especially the `bosh vms` request to a foundation with many deployments and VMs.
2. Keep monitoring BOSH director VM resource usage (all below commands can be executed with the user vcap).
3. If an unusual number of tasks are confirmed in the database or under the directory /var/vcap/store/director/tasks:
4. Usually the database only consumes a few GBs of disk space. In case the disk usage of /var/vcap/store/postgres-xx reaches 50GB or even higher, it's recommended to reclaim disk space with VACUUM FULL. The command requires an exclusive lock on the table it is working on, thus we recommend executing VACUUM FULL in the time window when BOSH director is not serving a high load. VACUUM FULL also requires some additional space temporarily for a copy of the new shrunk table, if the persistent disk is already 100% full, you'll have to make some space such as 10GB, by moving away some debug logs under /var/vcap/store/director/tasks to make space.
bosh=# select schemaname as table_schema, relname as table_name, pg_size_pretty(pg_total_relation_size(relid)) as total_size, pg_size_pretty(pg_table_size(relid)) as table_size, pg_size_pretty(pg_relation_size(relid)) as data_size, pg_size_pretty(pg_indexes_size(relid)) as index_size from pg_catalog.pg_statio_user_tables order by pg_total_relation_size(relid) desc limit 10; table_schema | table_name | total_size | table_size | data_size | index_size --------------+-----------------------+------------+------------+------------+------------ public | tasks | 1409 MB | 1388 MB | 13 MB | 21 MB public | templates | 544 kB | 416 kB | 184 kB | 128 kB public | instances | 480 kB | 416 kB | 8192 bytes | 64 kB public | events | 360 kB | 248 kB | 240 kB | 112 kB
5. The last thing is to verify if autovacuum is functioning correctly. This can be confirmed with the SQL query below. In case last_autovacuum is empty with all tables or the timestamp is very old, `monit restart postgres` is recommended to recover autovacuum.
bosh=# select last_autovacuum, relname from pg_stat_user_tables; relname | last_autovacuum ---------+----------------- tasks | ---------+----------------- ...