BOSH health check tasks fails to acquire lock

Products

Operations Manager

Issue/Introduction

Symptoms:
You noticed on your Health Watch dashboard that director is failing but checking into Ops Manager and your IAAS its up and running.

Check if there any bosh-health-check tasks that are failing

$ bosh tasks recent 100

Tue Oct 16 22:09:33 UTC 2018    Tue Oct 16 22:09:45 UTC 2018    p-healthwatch-52c4967dfad7c1e8d74c    bosh-health-check    delete deployment bosh-health-check    Failed to acquire lock for lock:deployment:bosh-health-check uid: 231473e8-###-###-####-7b3e53863bbc. Locking task id is...    
401977    error         Tue Oct 16 22:09:13 UTC 2018    Tue Oct 16 22:09:26 UTC 2018    p-healthwatch-52c4967dfad7c1e8d74c    bosh-health-check    create deployment                      Failed to acquire lock for lock:deployment:bosh-health-check uid: 1beff443-###-####-####-058f9a285009. Locking task id is...    
4

Get the debug logs using on one of the failed tasks to get locking task id

In this example bosh task 401977 --debug

Find out which task id is holding the lock. You will see something like this

D, [2018-10-29T17:07:37.900515 #9494] [task:549603] DEBUG -- DirectorJobRunner: (0.050747s) (conn: 46982301745260) INSERT INTO `events` (`parent_id`, `timestamp`, `user`, `action`, `object_type`, `object_name`, `error`, `task`, `deployment`, `instance`, `context_json`) VALUES (295330, '2018-10-29 17:07:37', 'p-healthwatch-52c4967dfad7c1e8d74c', 'delete', 'deployment', 'bosh-health-check', 'Failed to acquire lock for lock:deployment:bosh-health-check uid: 311eb368-####-####-####-db3ae359bf87. Locking task id is 549550', '549603', 'bosh-health-check', NULL, '{}')

From the log above we can see locking task id is 549550. Get the debug logs on the locking task id

bosh task 549550 --debug

Search for "external-cpi" logs and notice how long it takes APIs to your IaaS.

D, [2018-10-29T17:01:34.799968 #7856] [] DEBUG -- DirectorJobRunner: [external-cpi] [cpi-585407] request: {"method":"delete_vm","arguments":["vm-4a0b664f-####-###-####-07007074b3c1"],"context":{

D, [2018-10-29T17:07:27.181877 #7856] [] DEBUG -- DirectorJobRunner: [external-cpi] [cpi-585407] response: {"result":null,"error":null,"log":""}, err: , exit_status: pid 7893 exit 0

Environment

Cause

`bosh-health-check` is run every 10 minutes, and it creates a new BOSH deployment and then deletes the deployment. Its taking longer than 10 minutes for the test to finish. However, a new test always starts in 10 minutes, which causes a deployment lock failure.

In most cases this is likely happening because the underlying IaaS (GCP on the above example), is taking longer to create and delete the VM in `bosh-health-check` deployment. For example above, its already taking almost 7 minutes to delete a vm which then exceeds the 10-minute window.

Resolution

If external CPIs is taking long you can increase timeout on the bosh-health-check app

cf target -o system -s healthwatch
check existing TIMEOUT (default 10m) using cf env bosh-health-check | egrep TIMEOUT
increase TIMEOUT sample command would be cf set-env bosh-health-check TIMEOUT 15m
cf restage bosh-health-check.

Please note that if Healthwatch's errand push-app is run, this will redeploy the bosh-health-check app and will revert back to default settings which is 10m