How to clean up stale BOSH tasks history from BOSH Director console
search cancel

How to clean up stale BOSH tasks history from BOSH Director console

book

Article ID: 293448

calendar_today

Updated On:

Products

Operations Manager

Issue/Introduction

BOSH Director manages task history:
  • In the database for tasks metadata.
  • On disk, /var/vcap/store/director/tasks, for tasks debug logs.

BOSH Director only keeps the latest 2000 tasks for each task type: vms, ssh, snapshot_deployment, update_deployment, update_release, run_errand etc. However, due to some race conditions and performance issues, those tasks are probably not cleaned up correctly.

As a result, the tasks table in database and disk utilization under /var/vcap/store/director/tasks keeps growing, which causes significant resource and performance issues on the BOSH Director VM.

There was an improvement in BOSH director release v271.7.0 to the way that task clean up happens, the release is included in Ops Manager v2.10.9. Since this release, scheduled task cleanups should better handle large numbers of tasks. Previously the tasks were all loaded into memory to destroy them, which could cause the job to crash if there were too many tasks to clean up at once.


Resolution

IMPORTANT: This solution is NOT applicable when it meets both conditions: 

  • Ops Manager 2.10.3 or below (BOSH v270 or below)
  • Huge number of tasks, such as a million or more
With BOSH Director delivered with Ops Manager 2.10.3 or below, only 10 tasks can be deleted at a time. In addition, prior to each deletion, BOSH queries and sorts a large dataset which generates very high load on the database and causes slowness. Please contact Tanzu Support if your foundation meets the two conditions above.

To identify if a BOSH Director VM encountered the stale tasks problem, please follow the instructions below:

1. `ssh vcap@<director_IP>`, vcap user password can be found on Ops Manager UI.

2. Open the BOSH Director console: `/var/vcap/jobs/director/bin/console`. 

3. Execute the following  query: 
Bosh::Director::Models::Task.group_and_count(:type).all.map {|a| "type: #{a[:type]} count: #{a[:count]}"} 
 

The above query will return task counts for each type:

[
  "type: cck_apply count: 2",
  "type: cck_scan count: 3",
  "type: cck_scan_and_fix count: 22",
  "type: delete_artifacts count: 21",
  "type: delete_deployment count: 2001",
  "type: delete_stemcell count: 1",
  "type: fetch_logs count: 35",
  "type: run_errand count: 210",
  "type: scheduled_dns_blobs_cleanup count: 2001",
  "type: scheduled_events_cleanup count: 1475",
  "type: scheduled_orphaned_disk_cleanup count: 6",
  "type: scheduled_orphaned_vm_cleanup count: 9",
  "type: snapshot_deployment count: 1649",
  "type: snapshot_deployments count: 63",
  "type: snapshot_self count: 63",
  "type: ssh count: 1296",
  "type: update_deployment count: 2001",
  "type: update_release count: 580",
  "type: update_stemcell count: 4",
  "type: vms count: 25780"
]

 

In this sample output, the type `vms` has a count larger than the max limit 2000.Those tasks were not cleaned up by BOSH Director correctly. 

For Ops Manager 2.10.0 or below (BOSH release v270.11 or below), to clean up these stale tasks manually, follow the directions below:

1. We suggest you stop any heavy loads against BOSH Director during the operation.

2. Older tasks can be removed from both the database and disk (debug logs) with TaskRemover API, it only supports deleting a fixed number 10 tasks.

3. Count how many times the deletion should be executed: (25780 - 2000)/10 = 2378.

4. Create a TaskRemover object - `tr=Bosh::Director::Api::TaskRemover.new(2000)

5. Repeat the `remove` call 2378 times and keep the recent 2000 tasks only - `2378.times {tr.remove('vms')}`


6. Confirm the tasks are removed with:
- `Bosh::Director::Models::Task.group_and_count(:type).all.map {|a| "type: #{a[:type]} count: #{a[:count]}"}` 


For Ops Manager 2.10.4 ~ 2.10.8 (BOSH release v270.12 ~ v271.6), more than 10 tasks can be deleted per each tr.remove() call, thus the above steps can be optimized as:

1. Count how many tasks should be removed: 25780 - 2000 = 23780.

2. Create a TaskRemover object: `tr=Bosh::Director::Api::TaskRemover.new(2000)

3. Remove 23780 tasks and keep the recent 2000 tasks only: `tr.remove('vms', 23780)`

4. Confirm the tasks are removed with:

- `Bosh::Director::Models::Task.group_and_count(:type).all.map {|a| "type: #{a[:type]} count: #{a[:count]}"}` 


For Ops Manager 2.10.9 and above (BOSH release v271.7 and above), all tasks except for the most recent specified by Bosh::Director::Api::TaskRemover.new() can be deleted by single tr.remove() call, count parameter is removed from the method, thus the above steps can be optimized as:

1. Count how many tasks should be removed: 25780 - 2000 = 23780.

2. Create a TaskRemover object: `tr=Bosh::Director::Api::TaskRemover.new(2000)

3. Remove 23780 tasks and keep the recent 2000 tasks only: `tr.remove('vms')`

4. Confirm the tasks are removed with:

- `Bosh::Director::Models::Task.group_and_count(:type).all.map {|a| "type: #{a[:type]} count: #{a[:count]}"}` 


Usually only `vms` tasks are filling up the database because Prometheus or/and some scripts keep polling VMs state with `bosh vms` requests. In this case, the cleanup could be automated by creating a bash script as below and executing it.

  • For Ops Manager 2.10.3 or below (BOSH release v270 or below):
#/bin/bash
/var/vcap/jobs/director/bin/console <<EOF
count=(Bosh::Director::Models::Task.where(type: 'vms').count-2000)/10
if count > 0
    tr=Bosh::Director::Api::TaskRemover.new(2000)
    count.times {tr.remove('vms')}
end
EOF
  • For Ops Manager 2.10.4 ~ 2.10.8 (BOSH release v270.12 ~ v271.6):
#/bin/bash

/var/vcap/jobs/director/bin/console <<EOF
count=Bosh::Director::Models::Task.where(type: 'vms').count-2000
if count > 0
    tr=Bosh::Director::Api::TaskRemover.new(2000)
    tr.remove('vms', count)
end
EOF
  • For Ops Manager 2.10.9 and above (BOSH release v271.7 and above), count parameter has been removed from remove method, the method removes all older tasks except for recent 2000 (specified in new method). 
#/bin/bash

/var/vcap/jobs/director/bin/console <<EOF
count=Bosh::Director::Models::Task.where(type: 'vms').count-2000
if count > 0
    tr=Bosh::Director::Api::TaskRemover.new(2000)
    tr.remove('vms')
end
EOF