Apps Continue to Serve Routes even after App Deletion due to "Mysql2::Error: This connection is in use by"

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Symptoms:

A failure condition has been observed where after an app is deleted, the app instances are not cleaned up from the Diego cells and continue to serve routes. Correspondingly, if KPI metrics are setup for the platform, bbs.Domain.cf-apps and bbs.Domain.cf-task metrics will not report any values.

Also, the following errors are reported in the clock_global and cloud_controller log snippets:

From cloud_controller_clock log :

018-03-08 13:19:23+0000] I, [2018-03-08T13:19:23.001275 #13390]  INFO -- : Triggering 'diego_sync.job'
[2018-03-08 13:19:23+0000] E, [2018-03-08T13:19:23.005365 #13390] ERROR -- : Mysql2::Error: 
This connection is in use by: #<Thread:0x007fda8585c548@/var/vcap/data/packages/cloud_controller_ng
/35057b24ca81c7d0353dcc98c12429ee0b1210df/cloud_controller_ng/lib/utils/workpool.rb:10 dead> 
(Sequel::DatabaseError)
[2018-03-08 13:19:23+0000] /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby
/2.3.0/gems/mysql2-0.4.8/lib/mysql2/client.rb:120:in `_query'
From cloud_controller_ng log :

{"timestamp":1520515163.0045354,"message":"Mysql2::Error: This connection is in use by: 
#<Thread:0x007fda8585c548@/var/vcap/data/packages/cloud_controller_ng/35057b24ca81c7d0353dcc98c12429ee0b1210df
/cloud_controller_ng/lib/utils/workpool.rb:10 dead>: SELECT * FROM `clock_jobs` WHERE (`name` = 'diego_sync') 
LIMIT 1","log_level":"error","source":"cc.background","data":{},"thread_id":70288275466120,"fiber_id":
70287992085180,"process_id":13390,"file":"/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/
bundle/ruby/2.3.0/gems/sequel-4.49.0/lib/sequel/database/logging.rb","lineno":88,"method":"block in log_each"}
{"timestamp":1520515163.009266,"message":"Mysql2::Error: This connection is in use by: 
#<Thread:0x007fda8585c548@/var/vcap/data/packages/cloud_controller_ng/35057b24ca81c7d0353dcc98c12429ee0b1210df
/cloud_controller_ng/lib/utils/workpool.rb:10 dead> (Sequel::DatabaseError)","log_level":"error","source":
"cc.clock","data":{},"thread_id":70288275466120,"fiber_id":70287992085180,"process_id":13390,"file":
"/var/vcap/data/packages/cloud_controller_ng/35057b24ca81c7d0353dcc98c12429ee0b1210df/cloud_controller_ng
/lib/cloud_controller/clock/scheduler.rb","lineno":37,"method":"block in start"}

Environment

Cause

In PCF 1.12 and above, the task of maintaining a consistent state of the apps between the cloud controller and the Diego is accomplished by the diego_sync task in the clock_global job. In certain situations, when worker threads in the "diego_sync task" get killed while still holding on to a database connection, subsequent threads will not be able to get hold of the database connection. This will cause the diego_sync task to not complete. The clock global and cloud controller will log errors related to this - see snippets above. This results in an inconsistent state where an app that is deleted via the cloud controller API call is not cleaned up from the Diego cells.

Resolution

Login to the clock_global VM and restart the cloud_controller_clock process via monit :

monit restart cloud_controller_clock

Fix

This failure condition will be handled in an appropriate manner by the platform in a future PCF release.

Note: Prior to PCF 1.12, the task of reconciling the state of apps between the cloud controller and the Diego was accomplished by the nsync_bulker as part of the Cloud Controller Bridge in the diego_brain VM. This functionality is now part of the cloud controller clock global job. See release notes here.