Symptoms:
Customers encounter application crashes and may want to verify the maximum process limit in their Diego cells and containers. Applications with many threads and process are more susceptible to problems related to these limits.
An application gets ERR runtime: may need to increase max user processes.
Error Message:
Application crash message during cf restart (or cf push):
2018-03-06T10:51:22.76-0500 [CELL/1] OUT Container became unhealthy [31m2018-03-06T10:51:22.76-0500 [CELL/SSHD/1] ERR runtime: failed to create new OS thread (have 8 already; errno=11) [0m [31m2018-03-06T10:51:22.76-0500 [CELL/SSHD/1] ERR runtime: may need to increase max user processes (ulimit -u) [0m [31m2018-03-06T10:51:22.76-0500 [CELL/SSHD/1] ERR fatal error: newosproc [0m 2018-03-06T10:51:22.76-0500 [CELL/SSHD/1] OUT Exit status 0
Tanzu Application Service (versions 1.12 through 2.0.7) had a process limit of 1024 on application containers. This limit was meant to protect Diego cell from malicious applications. Most applications using Buildpacks should not reach the "1024" limit, however certain applications running many threads or process may reach this limit. However, this problem has been fixed in Tanzu Application Service version 2.0.8 and higher; the limit has been set back to 0 (unlimited).
The following steps can be used to manually adjust pid_limit in the TAS (cf) manifest; it should be used with care, and is not recommended for Production environments.
1. On Operations Manager, run the following command to dump cf manifest:
ubuntu@bosh-stemcell:~$ bosh2 -e lab12 -d cf-#################### manifest > cf.yml
2. Edit the manifest and insert the field pid_limit:0
(this goes under - name: cloud_controller_ng > properties: > cc: > diego:
ubuntu@bosh-stemcell:~$ vim cf.yml - name: cloud_controller_ng properties: cc: diego: + pid_limit: 0
3. Run a manual deploy of the edited manifest:
ubuntu@bosh-stemcell:~$ bosh2 -e lab12 deploy -d cf-#################### cf.yml Using environment '10.193.77.11' as user 'director' (bosh.*.read, openid, bosh.*.admin, bosh.read, bosh.admin) Using deployment 'cf-####################' instance_groups: - name: cloud_controller jobs: - name: cloud_controller_ng properties: cc: diego: + pid_limit: "<redacted>" Continue? [yN]:
The command will run and update cloud controller VM's with this setting.
WARNING: Changes made by BOSH deploy will be overwritten by Operations Manager update or upgrade.
4. Restart the problematic applications so that pid_limit
is increased:
cf restart <app-name> =========
Verification of process limit or process count
You can run the following commands to validate that the changes are reflected:
1. Connect to Diego cell VM on the system:
bosh -e <environment> -d cf-#################### ssh diego_cell/0
or:
ssh vcap@<Diego cell IP> (credentials from PAS > Credentials > Diego > VM Credentials)
2. List out garden LRP's running on Diego cell.
$ /var/vcap/packages/runc/bin/runc list ID PID STATUS BUNDLE CREATED OWNER ########-####-####-####-#### 32024 running /var/vcap/data/garden/depot/########-####-####-####-#### 2018-03-07T23:27:31.256036794Z root ########-####-####-####-#### 33103 running /var/vcap/data/garden/depot/########-####-####-####-#### 2018-03-07T23:28:51.026466237Z root
3. Now you can use runc events --stats
to see the pid usage, using jq
to pick out those values:
# /var/vcap/packages/runc/bin/runc events --stats ########-####-####-####-#### | jq .data.pids { "current": 18, "limit": 1024 }
After restarting applications with setting limit: 0, you should see the limit
field disappear (this means change applied successfully):
{ "id": "########-####-####-####-####", "pids": { "current": 18 } }
4. Here’s a 1-liner that scans everything and reports back the container handles:
# for c in $(/var/vcap/packages/runc/bin/runc list | tail -n+2 | cut -f1 -d' '); do /var/vcap/packages/runc/bin/runc events --stats $c | jq '{id, pids: .data.pids}'; done { "id": "########-####-####-####-####", "pids": { "current": 18, "limit": 1024 } }
5. If you need to track down an instance of a specific app (for example, let’s say they have applied the change and want to validate that it works after they cf restart
a particular app) then you can use cfdot
to find it as follows:
# cf app <app-name> --guid (gets the app guid) # cf curl /v2/apps/<guid>/stats | grep -I host (gets the Diego IP address) # app_guid=########-####-####-####-############ # cfdot actual-lrp-groups | grep $app_guid | jq '.instance | {instance_guid, cell_id}' { "instance_guid": "########-####-####-####-####", "cell_id": "########-####-####-####-############" }