Clean up BOSH director VM with high disk usage due to very large database files and tasks debug logs
search cancel

Clean up BOSH director VM with high disk usage due to very large database files and tasks debug logs

book

Article ID: 293686

calendar_today

Updated On:

Products

Operations Manager

Issue/Introduction

Two major causes could result in high disk usage on the BOSH director VM:
 

1. BOSH director clients such as Prometheus and/or scripts could generate new tasks faster than BOSH director removes old tasks. As a result, the BOSH database (PostgreSQL) tasks table keeps growing. In reverse, it causes further slowness and more tasks filling up in the table. There were improvements in recent BOSH releases, the major fix is available in BOSH v271, which is shipped with Ops Manager v2.10.4.
 

2. When BOSH director removes data from Postgres database, the disk space is not reclaimed by Postgres. The autovacuum daemon removes dead row versions in tables and indexes and marks the space available for future reuse. However, it will not return the space to the operating system. As result, database files size could keep growing more than it is needed due to additional index and TOAST storage. Please refer to the Postgres document Routine Vacuuming for further details.

On BOSH director VM, 3 major directories would consume high disk space in persistent disk:
 

  • /var/vcap/store/blobstore: BOSH blobstore (in case BOSH director VM internal blobstore is used)
  • /var/vcap/store/director:  debug logs 
  • /var/vcap/store/postgres-xx: database files (in case BOSH director VM internal database is used)


Environment

Product Version: 2.7

Resolution

Important note: One operation mistake can result in a unrecoverable situation and it is highly recommended that you engage Tanzu support before attempting this procedure.

To prevent BOSH director VM from running out of disk space or to resolve any performance or disk issues already occurring, we recommend:
 

1. Reducing the workload placed on BOSH director. As we know, Prometheus and some scripts may send a volume of API requests, higher than necessary, to the BOSH director. Please review and reduce the requests to necessary levels, especially the `bosh vms` request to a foundation with many deployments and VMs. 


2. Keep monitoring BOSH director VM resource usage (all below commands can be executed with the user vcap). 

  1. Check if disk space is running out with: df -h 
  2. Check files size of database, task debug logs, and director blobstore with: du -h –d1 /var/vcap/store 
  3. Count the tasks for each type: /var/vcap/packages/postgres-10/bin/psql -h 127.0.0.1 -p 5432 bosh -c "SELECT type, COUNT(*) FROM  tasks GROUP BY type;". There should not be more than 2,000 for each type, however due to the problems above, there could be a considerable number of “vms” tasks left in the table. 


3. If an unusual number of tasks are confirmed in the database or under the directory /var/vcap/store/director/tasks:

  1. In the case that the tasks count is small and BOSH director VM is not at high resource use, the KB How to clean up stale BOSH tasks history from director console includes a script to clean up old tasks from both database and disk. 
  2. In that case that an enormous number of tasks (for example >1M), you will have to clean up unused data and files manually with next steps.
  3. Type “vms” tasks are not critical, it's safe to delete all of them from the database with: /var/vcap/packages/postgres-10/bin/psql -h 127.0.0.1 -p 5432 bosh -c "DELETE FROM tasks WHERE type='vms';" Warning: This is a database deletion operation, we recommend executing the SQL statement very carefully. 
  4. /var/vcap/store/director/tasks/ directory stores tasks debug logs, which can be retrieved with: bosh task ID --debug. To reclaim unused disk space, you can (optional: scp those debug logs to a remote server and) remove them from the /var/vcap/store/director/tasks/ directory.
  5. Warning: This is a file deletion operation, we recommend removing files very carefully, any database files that are removed incorrectly could result in an unrecoverable situation. 


4. Usually the database only consumes a few GBs of disk space. In case the disk usage of /var/vcap/store/postgres-xx reaches 50GB or even higher, it's recommended to reclaim disk space with VACUUM FULL. The command requires an exclusive lock on the table it is working on, thus we recommend executing VACUUM FULL in the time window when BOSH director is not serving a high load. VACUUM FULL also requires some additional space temporarily for a copy of the new shrunk table, if the persistent disk is already 100% full, you'll have to make some space such as 10GB, by moving away some debug logs under /var/vcap/store/director/tasks to make space. 

  1. To verify if database files use far more space than needed, please execute the following SQL query, it lists up the top 10 schemaname:
    bosh=# select schemaname as table_schema,
        relname as table_name,
        pg_size_pretty(pg_total_relation_size(relid)) as total_size,
        pg_size_pretty(pg_table_size(relid)) as table_size,
        pg_size_pretty(pg_relation_size(relid)) as data_size,
        pg_size_pretty(pg_indexes_size(relid)) as index_size
    from pg_catalog.pg_statio_user_tables
    order by pg_total_relation_size(relid) desc limit 10;
     table_schema |      table_name       | total_size | table_size | data_size  | index_size
    --------------+-----------------------+------------+------------+------------+------------
     public       | tasks                 | 1409 MB    | 1388 MB    | 13 MB      | 21 MB
     public       | templates             | 544 kB     | 416 kB     | 184 kB     | 128 kB
     public       | instances             | 480 kB     | 416 kB     | 8192 bytes | 64 kB
     public       | events                | 360 kB     | 248 kB     | 240 kB     | 112 kB
  2. In the case that the index_size is far bigger than data_size, reindex TABLE <TABLE_NAME> could release space being used for index. 
  3. In the case that the total_size is far bigger than data_size but with small index_size as in the above example, VACUUM FULL can reclaim disk space used for TOAST storage. 
  4. VACUUM FULL is recommended at regular base when TOAST storage consumes a high volume of disk space. 


5. The last thing is to verify if autovacuum is functioning correctly. This can be confirmed with the SQL query below. In case last_autovacuum is empty with all tables or the timestamp is very old, `monit restart postgres` is recommended to recover autovacuum

bosh=# select last_autovacuum, relname from pg_stat_user_tables;
   relname | last_autovacuum
  ---------+-----------------
   tasks   |                     
  ---------+-----------------
  ...  ​