clock_global Job Causes High Memory Usage

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Symptoms:

The initial symptom is high memory usage of clock-global. The “dips” in the graph below is when the job or the VM had become so unresponsive that BOSH restarted it.

Logs:

 “/var/vcap/sys/clock_global/cloud_controller_clock_ctl.log”  

2017-11-10 XX:XX:19+0000] E, [2017-11-10TXX:XX:19.445328 #32029] 
ERROR -- : can't create Thread: Resource temporarily unavailable (ThreadError)
[2017-11-10 XX:XX:19+0000] 
/var/vcap/data/packages/cloud_controller_ng/109f95034faa3d1b74fb47b3c97b0d7c7a1b9688/cloud_controller_ng/lib/utils/workpool.rb:10:in `initialize'
[2017-11-10 XX:XX:19+0000] /var/vcap/data/packages/cloud_controller_ng/109f95034faa3d1b74fb47b3c97b0d7c7a1b9688/cloud_controller_ng/lib/utils/workpool.rb:10:in `new'

--------snip-----------

[2017-11-10 XX:XX:19+0000] /var/vcap/packages/ruby-2.3/lib/ruby/2.3.0/timeout.rb:106:in `timeout'
[2017-11-10 XX:XX:19+0000] 
/var/vcap/data/packages/cloud_controller_ng/109f95034faa3d1b74fb47b3c97b0d7c7a1b9688/cloud_controller_ng/app/jobs/timeout_job.rb:12:in

The number of open file handles are high. Nearly all of the open file handles are due to the clock_global process.

 $ lsof | wc -l 
406321

After restarting the file handles are back to normal. On tailing the logs Operators have found a new error message.

==> cloud_controller_clock_ctl.log <==
[2017-11-10 XX:XX:16+0000] 
E, [2017-11-10TXX:XX:16.323780 #26820] ERROR -- : droplet presence (Sequel::ValidationFailed)
[2017-11-10 XX:XX:16+0000]
/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/sequel-4.49.0/lib/sequel/model/base.rb:1824:in `save'
<snip>
[2017-11-10 XX:XX:16+0000] 
/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.3.0/gems/newrelic_rpm-3.17.1.326/lib/new_relic/agent.rb:443:in `disab

The number of file handles quickly grows at about a rate of 400 new handles per 30 seconds. Until the process starts giving the first error message.

Environment

Cause

The clock-global job tries to run a process called diego_sync.

There may be some orphaned tasks that had been marked as failed, but the sync process could not mark them as failed as the corresponding droplet had been removed.
There is a bug in the sync routine that caused it to continuously spawn threads due to the condition above, however, the process would not clean up the old threads. This bug is introduced in ERT 1.12.3

Resolution

In order to update the 'missing-diego-task' task’s state as failed in ccdb.

Steps on clock-global VM:

1. Find the 'missing-diego-task' in the logs of ‘cloud_controller_clock.log’

grep 'missing-diego-task' cloud_controller_clock.log | jq .data.task_guid | sort | uniq
"415644dd-1133-49c3-b9d8-60ae87c78b28"
"43ae4c03-94ee-40e4-8b71-71ba3a507be1"
"65ee10cf-07c2-488e-bba9-8af365b5d866"
"f3a423d6-7f76-49b9-a2b3-328b4b162810"

2. Update the ccdb for above tasks. Set the status as FAILED.

Example: script is

script.rb:

task_guids = ["415644dd-1133-49c3-b9d8-60ae87c78b28", "43ae4c03-94ee-40e4-8b71-71ba3a507be1", 
"65ee10cf-07c2-488e-bba9-8af365b5d866", "f3a423d6-7f76-49b9-a2b3-328b4b162810"]

TaskModel.where(guid: task_guids).each do |task|
  droplet_guid = task.droplet_guid
  droplet = DropletModel.find(guid: droplet_guid)
  if droplet.nil?
    puts "cancelling task #{task.guid}"
    task.this.update(state: "FAILED", failure_reason: "manually cancelled")
   end
end
  
cat script.rb | /var/vcap/jobs/cloud_controller_ng/bin/console

Notes:

Problems with the “clock_global” job are not catastrophic, however, may lead to a degradation of the ability to push apps over time (Days to Weeks depending on how frequently you push apps)
The bug in the sync routine mentioned above may be triggered for different reasons, other than “ERROR -- : droplet presence (Sequel::ValidationFailed)”
It is important to capture the logs before the system runs out of file handles, that is directly after a restart.