Application deployments fail with app trying to push apps via "cf push" or CI/CD with the following error :
"Error staging application: Staging error: staging failed"
Corresponding errors are seen in the Diego bbs logs :
{"timestamp":"1508793344.179113150","source":"bbs","message":"bbs.locket-lock.lost-lock","log_level": 2,"data":{"error":"rpc error: code = 4 desc = context deadline exceeded","lock":{"key":"bbs","owner": "d9072dec-d61c-4948-a2a6-cdca664462c8","type":"lock"},"session":"2","ttl_in_seconds":15}} {"timestamp":"1508793344.179389238","source":"bbs","message":"bbs.locket-lock.completed","log_level": 1,"data":{"lock":{"key":"bbs","owner":"d9072dec-d61c-4948-a2a6-cdca664462c8","type":"lock"},"session": "2","ttl_in_seconds":15}}
Download bbs and MySQL logs via the following instructions. The above issue could occur due to couple of conditions described below:
Check the timestamps for the bbs restarts from /var/vcap/sys/log/monit/bbs_ctl.log file:
------------ STARTING bbs_ctl at Mon Oct 23 20:45:52 UTC 2017 -------------- Removing stale pidfile ------------ bbs_ctl start at Mon Oct 23 21:00:53 UTC 2017 -------------- ------------ STARTING bbs_ctl at Mon Oct 23 21:00:53 UTC 2017 -------------- Removing stale pidfile ------------ bbs_ctl start at Mon Oct 23 21:15:55 UTC 2017 -------------- ------------ STARTING bbs_ctl at Mon Oct 23 21:15:55 UTC 2017 -------------- Removing stale pidfile ------------ bbs_ctl start at Mon Oct 23 21:30:57 UTC 2017 -------------- ------------ STARTING bbs_ctl at Mon Oct 23 21:30:57 UTC 2017 -------------- Removing stale pidfile ------------ bbs_ctl start at Mon Oct 23 21:45:58 UTC 2017 -------------- ------------ STARTING bbs_ctl at Mon Oct 23 21:45:58 UTC 2017 -------------- Removing stale pidfile
If the restarts are occurring every 15 minutes starting on top of the hour, then this is related to logrotate cron task for rotating MySQL audit files causing periodic high I/O writes. During this very small window of high I/O, the locket service does not get a timebound response back from the database causing locket to lose the lock temporarily. Once the logrotate task is completed, the locket service is able to acquire the lock successfully.
To confirm, the periodic high I/O writes, check the output from the command pidstat on the primary MySQL VM:
# pidstat -dl 60 08:29:56 PM UID PID kB_rd/s kB_wr/s kB_ccwr/s Command 08:30:56 PM 0 169 0.00 2.27 0.00 jbd2/vda1-8 08:30:56 PM 0 367 0.53 0.00 0.00 kworker/u8:0 -> 08:30:56 PM 0 758 5.20 27113.67 98.53 cron <--- 08:30:56 PM 0 908 0.00 0.47 0.00 svlogd -tt /var/vcap/bosh/log 08:30:56 PM 0 1733 0.00 3.87 0.00 jbd2/vda3-8 08:30:56 PM 0 1757 0.00 0.27 0.00 /sbin/auditd 08:30:56 PM 0 1809 0.00 0.40 0.00 /var/vcap/bosh/bin/monit -I -c /var/vcap/bosh/etc/monitrc 08:30:56 PM 0 9762 0.00 0.13 0.00 /var/vcap/bosh/bin/bosh-agent -P ubuntu -C /var/vcap/bosh/agent.json 08:30:56 PM 101 14960 0.00 1.07 0.00 /usr/sbin/rsyslogd 08:30:56 PM 1000 15772 0.00 0.07 0.00 tee -a /var/vcap/sys/log/consul_agent/consul_agent.stdout.log 08:30:56 PM 1000 15783 0.00 0.07 0.00 /var/vcap/packages/consul/bin/consul agent -config-dir=/var/vcap/jobs/consul_agent/config -recursor=10.81.101.21 -recursor=10.8 08:30:56 PM 1000 15813 0.07 0.00 0.00 /var/vcap/packages/metron_agent/metron --config /var/vcap/jobs/metron_agent/config/metron_agent.json 08:30:56 PM 1000 16256 26.67 2042.00 68.93 /var/vcap/packages/mariadb/bin/mysqld --basedir=/var/vcap/packages/mariadb --datadir=/var/vcap/store/mysql --plugin-dir=/var/vc 08:30:56 PM 1000 16745 0.00 0.20 0.00 /var/vcap/packages/cf-mysql-cluster-health-logger/bin/cf-mysql-cluster-health-logger -configPath=/var/vcap/jobs/mysql/config/cl 08:30:56 PM 1000 16869 0.00 0.33 0.00 /var/vcap/packages/mysql-metrics/bin/mysql-metrics -c=/var/vcap/jobs/mysql-metrics/config/mysql-metrics-config.yml -l=/var/vcap 08:30:56 PM 0 26051 0.00 21884.33 0.00 jbd2/vdc1-
By default, audit logging is enabled.
Disable mysql audit logging to workaround this issue. Follow instructions in Step 10 from documentation here.
This is a more severe case where the locket service will lose the BBS lock and another BBS will grab the lock. See the error snippet "Error 1205: Lock wait timeout exceeded; try restarting transaction"
from the bbs logs :
"timestamp":"1507608220.414745331","source":"bbs", "message":"bbs.request.remove-actual-lrp.failed-removing-actual-lrp", "log_level":2,"data":{"error":"Error 1205: Lock wait timeout exceeded; try restarting transaction","index":0, "method":"POST","process_guid":"d15f873b-c295-43c8-a696-83f515cdde32-f166d12d-d97b-4d25-86e7-87ced6054e98", "request":"/v1/actual_lrps/remove", "session":"4194.1"}}
This error means the locket service query to the MySQL database was waiting for a row lock for more than a configurable amount of seconds. This configurable value defaults to 50 seconds.
In this condition, each of the BBS job will cycle through losing the lock and grabbing the lock. During this time, this has an undesirable effect of app staging failures.
Using the "iostat -mx 1" command, check for "%util" column. If there is a sustained value of over 90%, then this is likely attributed to non-performant storage backend used by the persistent store for the MySQL VM.
Check for storage performance at the infrastructure level to determine other IO bottlenecks and capacity sizing issues.
Important Note
In older versions of the Elastic Runtime, consul provided following functionality:
In PCF 1.11, that locking mechanism has been replaced with a new service called locket
.
Additional info on the locket service can be found in the diego design notes. Also check PCF 1.11 release notes for the details on locket service.
In order to check for the job that holds the lock, use the "cfdot locks" command from any of Diego BBS or Diego brain VMs.