CF-MySQL-Broker Fails to Start in PCF
search cancel

CF-MySQL-Broker Fails to Start in PCF

book

Article ID: 297817

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Symptoms:

The P-MySQL service brokers enable applications to bind to MySQL databases in the Pivotal MySQL database cluster. The service broker can fail to start if the queries it makes to its database take longer than 20 seconds. If the service broker(s) are unavailable, you will not be able to create/delete or bind/unbind P-MySQL services.

Environment


Cause

Retrieve the service broker logs by downloading the logs for the Broker from the Status tab and Log tab of the MySQL Tile on the Operations Manager web interface. Also, download the MySQL server logs for each node. See "Viewing Logs for Pivotal Application Service Components" for instruction on how to retrieve these.

In the CF-MySQL broker log bundle, check to see if the cf-mysql-broker.log contains the following error:

Failed to sync broker plans at Tue Feb 14 12:51:03 UTC 2017
rake aborted!
ActiveRecord::StatementInvalid: Mysql2::Error: Timeout waiting for a response from the last query. (waited 20 seconds): SELECT mysql.user.user
FROM service_instances
JOIN mysql.db ON service_instances.db_name=mysql.db.Db
JOIN mysql.user ON mysql.user.User=mysql.db.User
WHERE plan_guid='17d793e6-6da6-4f0e-b58d-364a407166a0' AND mysql.user.user NOT LIKE 'root'

In the logs you downloaded previously for this MySQL node, check to see if are you getting errors like:

[ERROR] Slave SQL: ... Internal MariaDB error code: 1050

You can also check to see how frequently these errors are occurring in the log:

cat XXX.log |grep "error code: 1050" | wc -l

These errors are caused by the Galera MySQL replication treating these errors as failed replication transactions and causing it to write out a transaction log. If there are a great many of these, then it is possible that Galera has filled the MySQL nodes /var/vcap/store/mysql directory with Galera logs making any access that searches for files/directories under this directory extremely slow. Note that there may be other classes of SQL failures that cause the same result, so if you see other failures, you should continue to see if these logs are the source of the problem.

In one example of this, we had 9 million Galera logs, and it took 6 minutes to run the service broker queries. The CF-MySQL-Broker only waits for 20 seconds.

The CF-MySQL-Broker query looks like this:

SELECT mysql.user.user FROM service_instances
JOIN mysql.db ON service_instances.db_name=mysql.db.Db
JOIN mysql.user ON mysql.user.User=mysql.db.User;

The problem exposes itself here because this query has a JOIN to a table in another database. MYSQL has to locate that database in the filesystem and takes an extremely long time to find it.

To resolve this issue, it is important to know if the Galera logs are still being produced.

[ERROR] Slave SQL: Error ...., Internal MariaDB error code: 1050

If the above error  continues to be produced, you will need to find the application that is producing it. You will need to look at the query in the ERROR message, and this should show the database name and the query that is failing.

For example:

2017-02-14 14:58:21 139892976204544 [ERROR] Slave SQL: Error 'Table 'mybadtable' already exists' on query. Default database: 'cf_17411c80_76f5_90d4_326d_919283012341'. Query: 'CREATE TABLE `mybadtable` (
 `id` BIGINT NOT NULL,
 PRIMARY KEY(`id`))', Internal MariaDB error code: 1050

In this example, the database is cf_17411c80_76f5_90d4_326d_919283012341 and the query is failing because an app is trying to create a table that already exists. The problem was the app tries this in a loop forever. 

You will need to find the application that has this database in its environment and stop it:

cf target -o "my_org"  -s "my_space"
for i in `cf apps  | awk 'NR > 4 { print $1 }'`; do echo $i; cf env $i|grep the_db_you_are_looking_for ; done

This will iterate through all the apps you have access to in the org/space, skip over the headers in the CF apps results, and for each app, print out its name, get its environment and display the app details only for the database you are looking for.

Resolution

To recover from this situation, the Galera logs will need to be moved out of the MySQL directory on to the MySQL node. You can attempt to move them by using a combination of commands such as:

cd /var/vcap/store
mkdir mysql_backup
find mysql |grep GRA > mygrafiles
for i in `cat mygrafiles`; do mv $i mysqlbackup; done

However, this will run in the foreground and could take a very long time. There is a risk that the terminal session could timeout as well as kill the task. You would also have to open a second terminal session to the MySQL node and count the number of files in the MySQL backup directory to monitor its progress:

find mysqlbackup | wc -l

If that procedure is unworkable, there is another more risky alternative to accomplish this much faster.

Confirm the cluster state before doing anything. It is critical that the other two MySQL cluster nodes are up and in sync.  

Do not proceed any further if that is not the case.

1. Check the size of the MySQL directory - that could take several minutes if the directory has many files:

du -sh /var/vcap/store/mysql

2. Stop the MySQL database:

monit stop all

3. Confirm it is stopped with:

monit summary

4. Rename the MySQL directory to something like mysql_backup:

cd /var/vcap/store
mv mysql mysql_backup

5. From the Operations Manager, run the BOSH command to start the MySQL node - for example mysql/0:

bosh start mysql/0

6. Depending on what version of P-MySQL you are running, this BOSH task may timeout before the MySQL node gets its updates from a good node. If the tasks finishes with the error that the service is not running, just Bosh SSH back to the MySQL node and monitor the services with:

monit summary

Wait for it to complete. You can also watch the /var/vcap/sys/log/mysql/mysql.err.log for it to show that it has synced up.

Impact

If it does not join the cluster, there may be more serious issues, and that would need further investigation. Please contact Pivotal Support in such a scenario.