The initial symptom is high memory usage of clock-global. The “dips” in the graph below is when the job or the VM had become so unresponsive that BOSH restarted it.
Logs:
“/var/vcap/sys/clock_global/cloud_controller_clock_ctl.log” 2017-11-10 XX:XX:19+0000] E, [2017-11-10TXX:XX:19.445328 #32029] ERROR -- : can't create Thread: Resource temporarily unavailable (ThreadError) [2017-11-10 XX:XX:19+0000] /var/vcap/data/packages/cloud_controller_ng/109f95034faa3d1b74fb47b3c97b0d7c7a1b9688/cloud_controller_ng/lib/utils/workpool.rb:10:in `initialize' [2017-11-10 XX:XX:19+0000] /var/vcap/data/packages/cloud_controller_ng/109f95034faa3d1b74fb47b3c97b0d7c7a1b9688/cloud_controller_ng/lib/utils/workpool.rb:10:in `new' --------snip----------- [2017-11-10 XX:XX:19+0000] /var/vcap/packages/ruby-2.3/lib/ruby/2.3.0/timeout.rb:106:in `timeout' [2017-11-10 XX:XX:19+0000] /var/vcap/data/packages/cloud_controller_ng/109f95034faa3d1b74fb47b3c97b0d7c7a1b9688/cloud_controller_ng/app/jobs/timeout_job.rb:12:in
The number of open file handles are high. Nearly all of the open file handles are due to the clock_global process.
$ lsof | wc -l 406321
After restarting the file handles are back to normal. On tailing the logs Operators have found a new error message.
==> cloud_controller_clock_ctl.log <== [2017-11-10 XX:XX:16+0000] E, [2017-11-10TXX:XX:16.323780 #26820] ERROR -- : droplet presence (Sequel::ValidationFailed) [2017-11-10 XX:XX:16+0000] /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/sequel-4.49.0/lib/sequel/model/base.rb:1824:in `save' <snip> [2017-11-10 XX:XX:16+0000] /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/newrelic_rpm-3.17.1.326/lib/new_relic/agent.rb:443:in `disab
The number of file handles quickly grows at about a rate of 400 new handles per 30 seconds. Until the process starts giving the first error message.
The clock-global job tries to run a process called diego_sync.
In order to update the 'missing-diego-task' task’s state as failed in ccdb.
Steps on clock-global VM:
1. Find the 'missing-diego-task' in the logs of ‘cloud_controller_clock.log’
grep 'missing-diego-task' cloud_controller_clock.log | jq .data.task_guid | sort | uniq "415644dd-1133-49c3-b9d8-60ae87c78b28" "43ae4c03-94ee-40e4-8b71-71ba3a507be1" "65ee10cf-07c2-488e-bba9-8af365b5d866" "f3a423d6-7f76-49b9-a2b3-328b4b162810"
2. Update the ccdb for above tasks. Set the status as FAILED.
Example: script is
script.rb:
task_guids = ["415644dd-1133-49c3-b9d8-60ae87c78b28", "43ae4c03-94ee-40e4-8b71-71ba3a507be1", "65ee10cf-07c2-488e-bba9-8af365b5d866", "f3a423d6-7f76-49b9-a2b3-328b4b162810"] TaskModel.where(guid: task_guids).each do |task| droplet_guid = task.droplet_guid droplet = DropletModel.find(guid: droplet_guid) if droplet.nil? puts "cancelling task #{task.guid}" task.this.update(state: "FAILED", failure_reason: "manually cancelled") end end cat script.rb | /var/vcap/jobs/cloud_controller_ng/bin/console
Notes: