App "staging errors" due to BBS and Auctioneer Restarts

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Symptoms:

Application deployments fail with app trying to push apps via "cf push" or CI/CD with the following error :

"Error staging application: Staging error: staging failed"

Corresponding errors are seen in the Diego bbs logs :

{"timestamp":"1508793344.179113150","source":"bbs","message":"bbs.locket-lock.lost-lock","log_level":
2,"data":{"error":"rpc error: code = 4 desc = context deadline exceeded","lock":{"key":"bbs","owner":
"d9072dec-d61c-4948-a2a6-cdca664462c8","type":"lock"},"session":"2","ttl_in_seconds":15}}
{"timestamp":"1508793344.179389238","source":"bbs","message":"bbs.locket-lock.completed","log_level":
1,"data":{"lock":{"key":"bbs","owner":"d9072dec-d61c-4948-a2a6-cdca664462c8","type":"lock"},"session":
"2","ttl_in_seconds":15}}

Environment

Resolution

Download bbs and MySQL logs via the following instructions. The above issue could occur due to couple of conditions described below:

Case 1: BBS restarts every 15 minutes

Check the timestamps for the bbs restarts from /var/vcap/sys/log/monit/bbs_ctl.log file:

------------ STARTING bbs_ctl at Mon Oct 23 20:45:52 UTC 2017 --------------
Removing stale pidfile
------------ bbs_ctl start at Mon Oct 23 21:00:53 UTC 2017 --------------
------------ STARTING bbs_ctl at Mon Oct 23 21:00:53 UTC 2017 --------------
Removing stale pidfile
------------ bbs_ctl start at Mon Oct 23 21:15:55 UTC 2017 --------------
------------ STARTING bbs_ctl at Mon Oct 23 21:15:55 UTC 2017 --------------
Removing stale pidfile
------------ bbs_ctl start at Mon Oct 23 21:30:57 UTC 2017 --------------
------------ STARTING bbs_ctl at Mon Oct 23 21:30:57 UTC 2017 --------------
Removing stale pidfile
------------ bbs_ctl start at Mon Oct 23 21:45:58 UTC 2017 --------------
------------ STARTING bbs_ctl at Mon Oct 23 21:45:58 UTC 2017 --------------
Removing stale pidfile

If the restarts are occurring every 15 minutes starting on top of the hour, then this is related to logrotate cron task for rotating MySQL audit files causing periodic high I/O writes. During this very small window of high I/O, the locket service does not get a timebound response back from the database causing locket to lose the lock temporarily. Once the logrotate task is completed, the locket service is able to acquire the lock successfully.

To confirm, the periodic high I/O writes, check the output from the command pidstat on the primary MySQL VM:

# pidstat -dl 60

08:29:56 PM UID PID kB_rd/s kB_wr/s kB_ccwr/s Command 
08:30:56 PM 0 169 0.00 2.27 0.00 jbd2/vda1-8 
08:30:56 PM 0 367 0.53 0.00 0.00 kworker/u8:0 
-> 08:30:56 PM 0 758 5.20 27113.67 98.53 cron <---
08:30:56 PM 0 908 0.00 0.47 0.00 svlogd -tt /var/vcap/bosh/log 
08:30:56 PM 0 1733 0.00 3.87 0.00 jbd2/vda3-8 
08:30:56 PM 0 1757 0.00 0.27 0.00 /sbin/auditd 
08:30:56 PM 0 1809 0.00 0.40 0.00 /var/vcap/bosh/bin/monit -I -c /var/vcap/bosh/etc/monitrc 
08:30:56 PM 0 9762 0.00 0.13 0.00 /var/vcap/bosh/bin/bosh-agent -P ubuntu -C /var/vcap/bosh/agent.json 
08:30:56 PM 101 14960 0.00 1.07 0.00 /usr/sbin/rsyslogd 
08:30:56 PM 1000 15772 0.00 0.07 0.00 tee -a /var/vcap/sys/log/consul_agent/consul_agent.stdout.log 
08:30:56 PM 1000 15783 0.00 0.07 0.00 /var/vcap/packages/consul/bin/consul agent -config-dir=/var/vcap/jobs/consul_agent/config -recursor=10.81.101.21 -recursor=10.8 
08:30:56 PM 1000 15813 0.07 0.00 0.00 /var/vcap/packages/metron_agent/metron --config /var/vcap/jobs/metron_agent/config/metron_agent.json 
08:30:56 PM 1000 16256 26.67 2042.00 68.93 /var/vcap/packages/mariadb/bin/mysqld --basedir=/var/vcap/packages/mariadb --datadir=/var/vcap/store/mysql --plugin-dir=/var/vc 
08:30:56 PM 1000 16745 0.00 0.20 0.00 /var/vcap/packages/cf-mysql-cluster-health-logger/bin/cf-mysql-cluster-health-logger -configPath=/var/vcap/jobs/mysql/config/cl 
08:30:56 PM 1000 16869 0.00 0.33 0.00 /var/vcap/packages/mysql-metrics/bin/mysql-metrics -c=/var/vcap/jobs/mysql-metrics/config/mysql-metrics-config.yml -l=/var/vcap 
08:30:56 PM 0 26051 0.00 21884.33 0.00 jbd2/vdc1-

By default, audit logging is enabled.

Disable mysql audit logging to workaround this issue. Follow instructions in Step 10 from documentation here.

Case 2: Sustained high I/O on the primary MySQL node

This is a more severe case where the locket service will lose the BBS lock and another BBS will grab the lock. See the error snippet "Error 1205: Lock wait timeout exceeded; try restarting transaction" from the bbs logs :

"timestamp":"1507608220.414745331","source":"bbs",
"message":"bbs.request.remove-actual-lrp.failed-removing-actual-lrp",
"log_level":2,"data":{"error":"Error 1205: Lock wait timeout exceeded; try restarting transaction","index":0,
"method":"POST","process_guid":"d15f873b-c295-43c8-a696-83f515cdde32-f166d12d-d97b-4d25-86e7-87ced6054e98",
"request":"/v1/actual_lrps/remove",
"session":"4194.1"}}

This error means the locket service query to the MySQL database was waiting for a row lock for more than a configurable amount of seconds. This configurable value defaults to 50 seconds.

In this condition, each of the BBS job will cycle through losing the lock and grabbing the lock. During this time, this has an undesirable effect of app staging failures.

Using the "iostat -mx 1" command, check for "%util" column. If there is a sustained value of over 90%, then this is likely attributed to non-performant storage backend used by the persistent store for the MySQL VM.

Check for storage performance at the infrastructure level to determine other IO bottlenecks and capacity sizing issues.

Important Note

In older versions of the Elastic Runtime, consul provided following functionality:

Maintain a record of distributed locks for components (like Diego BBS). This locking mechanism is essential for maintaining "active" status for jobs (like BBS) greater than or equal to 3
Presence record for the cell

In PCF 1.11, that locking mechanism has been replaced with a new service called locket.

Additional info on the locket service can be found in the diego design notes. Also check PCF 1.11 release notes for the details on locket service.

In order to check for the job that holds the lock, use the "cfdot locks" command from any of Diego BBS or Diego brain VMs.