Pivotal Application Service internal MySQL proxy reports error "too many open files"
search cancel

Pivotal Application Service internal MySQL proxy reports error "too many open files"

book

Article ID: 293277

calendar_today

Updated On:

Products

VMware Tanzu SQL

Issue/Introduction

Symptoms:

This is for PAS versions 2.2 and greater.

PAS components - CC, Bulletin Board System (BBS), UAA, etc. can't open connections to the MySQL database. All MySQL dependent operations, such as cf login, cf push, cf apps, become unstable or partly unavailable.

On the PAS MySQL proxy VM(s), the proxy job logs messages to the proxy.combined.log indicating there are "too many open files":

{"timestamp":"1547776175.174618959","source":"/var/vcap/packages/proxy/bin/proxy","message":"/var/vcap/packages/proxy/bin/proxy.Error accepting client connection","log_level":2,"data":{"error":"accept tcp 0.0.0.0:3306: accept4: too many open files"}}

On the PAS MySQL server VM(s), messages are logged to the mysql.err.log indicating "Lock wait timeout exceeded; try restarting transaction". For example:

2019-01-18  1:50:27 139719575038720 [ERROR] mysqld: Lock wait timeout exceeded; try restarting transaction

On the PAS Diego Database, "failed-to-fetch-lock" messages are logged to locket's locket.stdout.log indicating:

Error 1205: Lock wait timeout exceeded; try restarting transaction

Note: In the worst case, the PAS MySQL proxy ephemeral disk can be filled up with logs in a few minutes.

Environment


Cause

By default, the PAS internal MySQL proxy is configured with up to 3000 file descriptors, which supports around 1500 connections. In addition, PAS MySQL server max_connections is also being configured and set to 1500.

Since Pivotal Cloud Foundry (PCF) v2.2 release, PAS components, such as Silk-Controller, Policy-Server, Policy-Server-Internal, and Locket, are allowed to keep up to a max 200 idle connections for each component.

On large foundations, 1500 connections on the PAS MySQL proxy or server can be used up. As result, those components will hit the error "can't open connections"when they try to establish a new connection to PAS MySQL.
 

There are several PAS components that have been seen to adversely affect the PAS MySQL database through their use of the database. These issues have been addressed in various PAS versions, so being on the latest version is highly recommended. If encountering issues like those described here where the PAS database appears to be overloaded, it might be useful to query the database to determine the number of connections for each component.

One way to do this would be to login to the PAS mysql server node taking all of the traffic from the proxies, capturing the processlist, then analyzing the processlist to determine usage by the PAS components. To capture the processlist, the following command can be used on the mysql node taking all traffic:

mysql --defaults-file=/var/vcap/jobs/pxc-mysql/config/mylogin.cnf -e 'SHOW PROCESSLIST;' > proclist.txt

This file could then be analyzed using the command:

grep Sleep proclist.txt | awk -F $'\t' '{print $4}' | sort | uniq -c | sort -nr

For example:

$ grep Sleep proclist.txt | awk -F $'\t' '{print $4}' | sort | uniq -c | sort -nr
     24 diego
     13 autoscale
     10 credhub
     10 ccdb
      8 NULL
      7 locket
      6 notifications
      6 networkpolicyserver
      2 silk
      2 routing
      2 app_usage_service
      1 uaa
      1 nfsvolume
      1 monitor

Resolution

Pivotal R&D is actively working on an fixes to the issues that impacted PAS v2.2, v2.3, and v2.4. The improvements are planned to be available in PAS v2.2.13 or above, v2.3.8 or above, and v2.4.4 or above. 

Until these patches are released, please apply one of the following resolutions:

Resolution 1

For an urgent case, such as the the foundation being broken with the error message, “too many open files.

1. bosh ssh into each MySQL proxy VM, sudo -i to root user.

2. Run the command, monit stop all

3. (Optional) If ephemeral disk utilization stays at a high level, delete log files under /var/vcap/sys/logs/proxy

4. Run the command, monit start all

Note: The above procedure will not resolve the root problem. Please follow resolution 2 or 3 to prevent the "too many open files" error.


Resolution 2

For PAS 2.2.x releases

1. On all PAS MySQL proxy VMs:
a. Edit /var/vcap/jobs/proxy/bin/proxy-ctl.
b. Replace ulimit -n 3000 with ulimit -n 20000, which supports around 10000 concurrent connections.
c. Restart proxy with command: monit restart proxy.
d. Confirm the max open files limit is increased. grep "Max open files" /proc/$(cat /var/vcap/sys/run/proxy/proxy.pid)/limits,  output should include "Max open files 20000 files"

2. On one of MySQL server VMs:
a. Login as root: mysql --defaults-file=/var/vcap/jobs/pxc-mysql/config/mylogin.cnf
b. Confirm the configuration before update: SHOW VARIABLES LIKE "max_connections";
c. SET GLOBAL max_connections = 5000; 
d. Confirm the update with SHOW VARIABLES LIKE "max_connections";
e. exit Note: The configuration will be overwritten when recreating the VMs. 

For PAS 2.3.0~2.3.5 releases

1. On Ops Manager > PAS tile, scale MySQL proxy to 7 instances as single instance supports limited connections.

2. On one of MySQL server VMs:
a. Login as root: mysql --defaults-file=/var/vcap/jobs/pxc-mysql/config/mylogin.cnf
b. Confirm the configuration before update: SHOW VARIABLES LIKE "max_connections";
c. SET GLOBAL max_connections = 5000; 
d. Confirm the update with SHOW VARIABLES LIKE "max_connections";
e. Run: exit

Note: The configuration will be overwritten when recreating the VMs. 


For PAS 2.3.6 release

On Ops Manager > PAS tile, scale MySQL proxy to 7 instances. In this release, MySQL server max_connections has been increased to 3500.  

Note: 7 proxy instances is not required in future fixed releases, it can be turned back to 2 instances. 


Resolution 3 (Optional)

Reduce max_idle_connections for Silk-Controller, Policy-Server, Policy-Server-Internal. By default, max_idle_connections is set to 200. 

Change max_idle_connections to 50. Run the command monit restart.

On all diego_database VMs, edit the following: 

  • silk-controller - /var/vcap/jobs/silk-controller/config/silk-controller.json

  • policy-server - /var/vcap/jobs/policy-server/config/policy-server.json

  • policy-server-internal - /var/vcap/jobs/policy-server-internal/config/policy-server-internal.json