Symptoms:
You noticed on your Health Watch dashboard that director is failing but checking into Ops Manager and your IAAS its up and running.
Check if there any bosh-health-check tasks that are failing
$ bosh tasks recent 100
Tue Oct 16 22:09:33 UTC 2018 Tue Oct 16 22:09:45 UTC 2018 p-healthwatch-52c4967dfad7c1e8d74c bosh-health-check delete deployment bosh-health-check Failed to acquire lock for lock:deployment:bosh-health-check uid: 231473e8-###-###-####-7b3e53863bbc. Locking task id is... 401977 error Tue Oct 16 22:09:13 UTC 2018 Tue Oct 16 22:09:26 UTC 2018 p-healthwatch-52c4967dfad7c1e8d74c bosh-health-check create deployment Failed to acquire lock for lock:deployment:bosh-health-check uid: 1beff443-###-####-####-058f9a285009. Locking task id is... 4
Get the debug logs using on one of the failed tasks to get locking task id
In this example bosh task 401977 --debug
Find out which task id is holding the lock. You will see something like this
D, [2018-10-29T17:07:37.900515 #9494] [task:549603] DEBUG -- DirectorJobRunner: (0.050747s) (conn: 46982301745260) INSERT INTO `events` (`parent_id`, `timestamp`, `user`, `action`, `object_type`, `object_name`, `error`, `task`, `deployment`, `instance`, `context_json`) VALUES (295330, '2018-10-29 17:07:37', 'p-healthwatch-52c4967dfad7c1e8d74c', 'delete', 'deployment', 'bosh-health-check', 'Failed to acquire lock for lock:deployment:bosh-health-check uid: 311eb368-####-####-####-db3ae359bf87. Locking task id is 549550', '549603', 'bosh-health-check', NULL, '{}')
From the log above we can see locking task id is 549550. Get the debug logs on the locking task id
bosh task 549550 --debug
Search for "external-cpi" logs and notice how long it takes APIs to your IaaS.
D, [2018-10-29T17:01:34.799968 #7856] [] DEBUG -- DirectorJobRunner: [external-cpi] [cpi-585407] request: {"method":"delete_vm","arguments":["vm-4a0b664f-####-###-####-07007074b3c1"],"context":{ D, [2018-10-29T17:07:27.181877 #7856] [] DEBUG -- DirectorJobRunner: [external-cpi] [cpi-585407] response: {"result":null,"error":null,"log":""}, err: , exit_status: pid 7893 exit 0
`bosh-health-check` is run every 10 minutes, and it creates a new BOSH deployment and then deletes the deployment. Its taking longer than 10 minutes for the test to finish. However, a new test always starts in 10 minutes, which causes a deployment lock failure.
In most cases this is likely happening because the underlying IaaS (GCP on the above example), is taking longer to create and delete the VM in `bosh-health-check` deployment. For example above, its already taking almost 7 minutes to delete a vm which then exceeds the 10-minute window.
If external CPIs is taking long you can increase timeout on the bosh-health-check app
Please note that if Healthwatch's errand push-app is run, this will redeploy the bosh-health-check app and will revert back to default settings which is 10m