This is for PAS versions 2.2 and greater.
PAS components - CC, Bulletin Board System (BBS), UAA, etc. can't open connections to the MySQL database. All MySQL dependent operations, such as cf login, cf push, cf apps, become unstable or partly unavailable.
On the PAS MySQL proxy VM(s), the proxy job logs messages to the proxy.combined.log indicating there are "too many open files":
{"timestamp":"1547776175.174618959","source":"/var/vcap/packages/proxy/bin/proxy","message":"/var/vcap/packages/proxy/bin/proxy.Error accepting client connection","log_level":2,"data":{"error":"accept tcp 0.0.0.0:3306: accept4: too many open files"}}
On the PAS MySQL server VM(s), messages are logged to the mysql.err.log indicating "Lock wait timeout exceeded; try restarting transaction". For example:
2019-01-18 1:50:27 139719575038720 [ERROR] mysqld: Lock wait timeout exceeded; try restarting transaction
On the PAS Diego Database, "failed-to-fetch-lock" messages are logged to locket's locket.stdout.log indicating:
Error 1205: Lock wait timeout exceeded; try restarting transaction
Note: In the worst case, the PAS MySQL proxy ephemeral disk can be filled up with logs in a few minutes.
max_connections is also being configured and set to 1500."can't open connections"when they try to establish a new connection to PAS MySQL.There are several PAS components that have been seen to adversely affect the PAS MySQL database through their use of the database. These issues have been addressed in various PAS versions, so being on the latest version is highly recommended. If encountering issues like those described here where the PAS database appears to be overloaded, it might be useful to query the database to determine the number of connections for each component.
One way to do this would be to login to the PAS mysql server node taking all of the traffic from the proxies, capturing the processlist, then analyzing the processlist to determine usage by the PAS components. To capture the processlist, the following command can be used on the mysql node taking all traffic:
mysql --defaults-file=/var/vcap/jobs/pxc-mysql/config/mylogin.cnf -e 'SHOW PROCESSLIST;' > proclist.txt
This file could then be analyzed using the command:
grep Sleep proclist.txt | awk -F $'\t' '{print $4}' | sort | uniq -c | sort -nr
For example:
$ grep Sleep proclist.txt | awk -F $'\t' '{print $4}' | sort | uniq -c | sort -nr
24 diego
13 autoscale
10 credhub
10 ccdb
8 NULL
7 locket
6 notifications
6 networkpolicyserver
2 silk
2 routing
2 app_usage_service
1 uaa
1 nfsvolume
1 monitor
Pivotal R&D is actively working on an fixes to the issues that impacted PAS v2.2, v2.3, and v2.4. The improvements are planned to be available in PAS v2.2.13 or above, v2.3.8 or above, and v2.4.4 or above.
Until these patches are released, please apply one of the following resolutions:
For an urgent case, such as the the foundation being broken with the error message, “too many open files”.
bosh ssh into each MySQL proxy VM, sudo -i to root user.monit stop all/var/vcap/sys/logs/proxymonit start all
Note: The above procedure will not resolve the root problem. Please follow resolution 2 or 3 to prevent the "too many open files" error.
/var/vcap/jobs/proxy/bin/proxy-ctl.ulimit -n 3000 with ulimit -n 20000, which supports around 10000 concurrent connections.monit restart proxy.grep "Max open files" /proc/$(cat /var/vcap/sys/run/proxy/proxy.pid)/limits, output should include "Max open files 20000 files"mysql --defaults-file=/var/vcap/jobs/pxc-mysql/config/mylogin.cnfSHOW VARIABLES LIKE "max_connections"; SET GLOBAL max_connections = 5000; SHOW VARIABLES LIKE "max_connections"; exit Note: The configuration will be overwritten when recreating the VMs. mysql --defaults-file=/var/vcap/jobs/pxc-mysql/config/mylogin.cnfSHOW VARIABLES LIKE "max_connections";SET GLOBAL max_connections = 5000; SHOW VARIABLES LIKE "max_connections"; Run: exitReduce max_idle_connections for Silk-Controller, Policy-Server, Policy-Server-Internal. By default, max_idle_connections is set to 200.
Change max_idle_connections to 50. Run the command monit restart.
On all diego_database VMs, edit the following:
silk-controller - /var/vcap/jobs/silk-controller/config/silk-controller.json
policy-server - /var/vcap/jobs/policy-server/config/policy-server.json
policy-server-internal - /var/vcap/jobs/policy-server-internal/config/policy-server-internal.json