Recreating worker instance of Concourse deployment takes over one hour to complete
search cancel

Recreating worker instance of Concourse deployment takes over one hour to complete

book

Article ID: 293845

calendar_today

Updated On:

Products

Operations Manager

Issue/Introduction

When bosh recreate worker instance in a Concourse deployment, sometimes it would take over one hour to complete. BOSH task debug logs like following could be observed. 
I, [2022-05-19T23:27:35.958921 #22159] [canary_update(worker/3c66f787-19bd-4fb2-8cf2-37ca03c9ce80 (1))] INFO -- DirectorJobRunner: Updating instance worker/3c66f787-19bd-4fb2-8cf2-37ca03c9ce80 (1), changes: "recreate"
I, [2022-05-19T23:27:35.984125 #22159] [canary_update(worker/3c66f787-19bd-4fb2-8cf2-37ca03c9ce80 (1))] INFO -- DirectorJobRunner: Running pre-stop for worker/3c66f787-19bd-4fb2-8cf2-37ca03c9ce80 (1)
I, [2022-05-19T23:27:37.025683 #22159] [canary_update(worker/3c66f787-19bd-4fb2-8cf2-37ca03c9ce80 (1))] INFO -- DirectorJobRunner: Running drain for worker/3c66f787-19bd-4fb2-8cf2-37ca03c9ce80 (1)
I, [2022-05-20T00:27:39.618791 #22159] [canary_update(worker/3c66f787-19bd-4fb2-8cf2-37ca03c9ce80 (1))] INFO -- DirectorJobRunner: Stopping instance worker/3c66f787-19bd-4fb2-8cf2-37ca03c9ce80 (1)
I, [2022-05-20T00:27:44.639530 #22159] [canary_update(worker/3c66f787-19bd-4fb2-8cf2-37ca03c9ce80 (1))] INFO -- DirectorJobRunner: Running post-stop for worker/3c66f787-19bd-4fb2-8cf2-37ca03c9ce80 (1)
I, [2022-05-20T00:27:44.643924 #22159] [canary_update(worker/3c66f787-19bd-4fb2-8cf2-37ca03c9ce80 (1))] INFO -- DirectorJobRunner: Snapshots are disabled; skipping
I, [2022-05-20T00:27:44.651135 #22159] [canary_update(worker/3c66f787-19bd-4fb2-8cf2-37ca03c9ce80 (1))] INFO -- DirectorJobRunner: Deleting VM
I, [2022-05-20T00:28:24.854123 #22159] [create_missing_vm(worker/3c66f787-19bd-4fb2-8cf2-37ca03c9ce80 (1)/1)] INFO -- DirectorJobRunner: Creating missing VM
I, [2022-05-20T00:28:24.915468 #22159] [create_missing_vm(worker/3c66f787-19bd-4fb2-8cf2-37ca03c9ce80 (1)/1)] INFO -- DirectorJobRunner: Creating VM
I, [2022-05-20T00:29:38.495637 #22159] [create_missing_vm(worker/3c66f787-19bd-4fb2-8cf2-37ca03c9ce80 (1)/1)] INFO -- DirectorJobRunner: deleting arp entries for the following ip addresses: ["x.x.x.x"]
I, [2022-05-20T00:29:49.874301 #22159] [canary_update(worker/3c66f787-19bd-4fb2-8cf2-37ca03c9ce80 (1))] INFO -- DirectorJobRunner: Updating persistent disk
As shown in debug logs, the drain task took one hour and prolonged whole recreating process. Sometimes drain script on worker could take a long time to drain the workload to other workers. So a timeout parameter is specified in drain command to avoid it running forever. By default the timeout set to 3600 seconds.

Environment

Product Version: 2.10

Resolution

If you don't want drain task to run at most one hour, you can temporarily modify drain script on worker instance to set timeout to a smaller value, e.g. 300 (5 minutes). This should be done on all workers.

=> /var/vcap/job/worker/bin/drain
start-stop-daemon \
  --pidfile $RUN_DIR/worker.pid \
  --remove-pidfile \
  --stop \
  --oknodo \
  --retry ${DRAIN_SIGNAL}/500/TERM/15/QUIT/2/KILL
However such change is not persistent and will be reverted when the worker instance is later update/recreated. To make a persistent change, add a drain_timeout property in Concourse deployment manifest for workers instance group. Then deploy Concourse with new manifest to take the change effect.