BOSH Director only keeps the latest 2000 tasks for each task type: vms, ssh, snapshot_deployment, update_deployment, update_release, run_errand etc. However, due to some race conditions and performance issues, those tasks are probably not cleaned up correctly.
As a result, the tasks table in database and disk utilization under /var/vcap/store/director/tasks keeps growing, which causes significant resource and performance issues on the BOSH Director VM.
There was an improvement in BOSH director release v271.7.0 to the way that task clean up happens, the release is included in Ops Manager v2.10.9. Since this release, scheduled task cleanups should better handle large numbers of tasks. Previously the tasks were all loaded into memory to destroy them, which could cause the job to crash if there were too many tasks to clean up at once.
IMPORTANT: This solution is NOT applicable when it meets both conditions:
Bosh::Director::Models::Task.group_and_count(:type).all.map {|a| "type: #{a[:type]} count: #{a[:count]}"}
The above query will return task counts for each type:
[ "type: cck_apply count: 2", "type: cck_scan count: 3", "type: cck_scan_and_fix count: 22", "type: delete_artifacts count: 21", "type: delete_deployment count: 2001", "type: delete_stemcell count: 1", "type: fetch_logs count: 35", "type: run_errand count: 210", "type: scheduled_dns_blobs_cleanup count: 2001", "type: scheduled_events_cleanup count: 1475", "type: scheduled_orphaned_disk_cleanup count: 6", "type: scheduled_orphaned_vm_cleanup count: 9", "type: snapshot_deployment count: 1649", "type: snapshot_deployments count: 63", "type: snapshot_self count: 63", "type: ssh count: 1296", "type: update_deployment count: 2001", "type: update_release count: 580", "type: update_stemcell count: 4", "type: vms count: 25780" ]
In this sample output, the type `vms` has a count larger than the max limit 2000.Those tasks were not cleaned up by BOSH Director correctly.
For Ops Manager 2.10.0 or below (BOSH release v270.11 or below), to clean up these stale tasks manually, follow the directions below:
1. We suggest you stop any heavy loads against BOSH Director during the operation.
2. Older tasks can be removed from both the database and disk (debug logs) with TaskRemover API, it only supports deleting a fixed number 10 tasks.
3. Count how many times the deletion should be executed: (25780 - 2000)/10 = 2378.
4. Create a TaskRemover object - `tr=Bosh::Director::Api::TaskRemover.new(2000)`
5. Repeat the `remove` call 2378 times and keep the recent 2000 tasks only - `2378.times {tr.remove('vms')}`
- `Bosh::Director::Models::Task.group_and_count(:type).all.map {|a| "type: #{a[:type]} count: #{a[:count]}"}`
For Ops Manager 2.10.4 ~ 2.10.8 (BOSH release v270.12 ~ v271.6), more than 10 tasks can be deleted per each tr.remove() call, thus the above steps can be optimized as:
1. Count how many tasks should be removed: 25780 - 2000 = 23780.
2. Create a TaskRemover object: `tr=Bosh::Director::Api::TaskRemover.new(2000)`
3. Remove 23780 tasks and keep the recent 2000 tasks only: `tr.remove('vms', 23780)`
4. Confirm the tasks are removed with:
- `Bosh::Director::Models::Task.group_and_count(:type).all.map {|a| "type: #{a[:type]} count: #{a[:count]}"}`
For Ops Manager 2.10.9 and above (BOSH release v271.7 and above), all tasks except for the most recent specified by Bosh::Director::Api::TaskRemover.new() can be deleted by single tr.remove() call, count parameter is removed from the method, thus the above steps can be optimized as:
1. Count how many tasks should be removed: 25780 - 2000 = 23780.
2. Create a TaskRemover object: `tr=Bosh::Director::Api::TaskRemover.new(2000)`
3. Remove 23780 tasks and keep the recent 2000 tasks only: `tr.remove('vms')`
4. Confirm the tasks are removed with:
- `Bosh::Director::Models::Task.group_and_count(:type).all.map {|a| "type: #{a[:type]} count: #{a[:count]}"}`
Usually only `vms` tasks are filling up the database because Prometheus or/and some scripts keep polling VMs state with `bosh vms` requests. In this case, the cleanup could be automated by creating a bash script as below and executing it.
#/bin/bash /var/vcap/jobs/director/bin/console <<EOF count=(Bosh::Director::Models::Task.where(type: 'vms').count-2000)/10 if count > 0 tr=Bosh::Director::Api::TaskRemover.new(2000) count.times {tr.remove('vms')} end EOF
#/bin/bash /var/vcap/jobs/director/bin/console <<EOF count=Bosh::Director::Models::Task.where(type: 'vms').count-2000 if count > 0 tr=Bosh::Director::Api::TaskRemover.new(2000) tr.remove('vms', count) end EOF
#/bin/bash /var/vcap/jobs/director/bin/console <<EOF count=Bosh::Director::Models::Task.where(type: 'vms').count-2000 if count > 0 tr=Bosh::Director::Api::TaskRemover.new(2000) tr.remove('vms') end EOF