App staging fails with Connection refused - connect(2) for "bbs.service.cf.internal" port 8889
search cancel

App staging fails with Connection refused - connect(2) for "bbs.service.cf.internal" port 8889

book

Article ID: 297959

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

This article applies to all VMware Tanzu Application Service (TAS) for VMs (Regular and Small Footprint) deployments. 

Versions Impacted: TAS 2.3.x, TAS 2.4.x, TAS 2.5.0 - 2.5.12, TAS 2.6.0 - 2.6.7, TAS 2.7.0 - 2.7.1

App staging fails with a 503 error:
Server error, status code: 503, error code: 170015, message: Runner is unavailable: Process Guid: <App-Instance-GUID>: Connection refused - Connection refused - connect(2) for "bbs.service.cf.internal" port 8889 (bbs.service.cf.internal:8889)

In bbs.stdout.log, we see the following error:
{"timestamp":"2019-10-01T15:56:36.524833203Z","level":"error","source":"bbs","message":"bbs.locket-lock.lost-lock","data":{"duration":15000212109,"error":"rpc error: code = 4 desc = context deadline exceeded","lock":{"key":"bbs","owner":"bb26bd04-0ae6-45bf-ad97-42751433fee4","type":"lock","type_code":1},"request-uuid":"c3cf1ac7-cb96-4c6c-465e-ba332ff71398","session":"4","ttl_in_seconds":15}}

After checking the MySQL for VMware Tanzu slow_query logs, we found a lot of aborted connections on locket and silk.

In the MySQL logs, we see the following error:
InnoDB: Difficult to find free blocks in the buffer pool (1587 search iterations)! 0 failed attempts to flush a page! Consider increasing the buffer pool size. It is also possible that in your Unix version fsync is very slow, or completely frozen inside the OS kernel. Then upgrading to a newer version of your operating system may help. Look at the number of fsyncs in diagnostic info below. Pending flushes (fsync) log: 1; buffer pool: 0. 1154586110 OS file reads, 328304199 OS file writes, 30369677 OS fsyncs. Starting InnoDB Monitor to print further diagnostics to the standard output.

 

Cause

The TAS tile hard-codes an unreasonably small buffer pool of 2GB for its internal MySQL hosts, especially for a larger platform.

Environment

Product Version: 2.4

Resolution

For the impacted versions of TAS listed below, please follow the workaround described to remove the hardcoded value of innodb_buffer_pool_size of 2GB. 

Versions Impacted: TAS 2.3.x, TAS 2.4.x, TAS 2.5.0 - 2.5.12, TAS 2.6.0 - 2.6.7, TAS 2.7.0 - 2.7.1


Workaround

1. Create a yaml file to remove the limit:
a. For TAS:
cat >> remove_hardcoded_buffer_pool_limit.yml << EOF
---
- type: remove
  path: /instance_groups/name=mysql/jobs/name=pxc-mysql/properties/engine_config/innodb_buffer_pool_size
EOF

b. For Small Footprint TAS:
cat >> remove_hardcoded_buffer_pool_limit.yml << EOF
---
- type: remove
  path: /instance_groups/name=database/jobs/name=pxc-mysql/properties/engine_config/innodb_buffer_pool_size
EOF

2. Log the BOSH CLI into the director for the affected environment.

3. Run bosh deployments to get the exact name of the affected Cloud Foundry. The output should be something like cf-5023942340.

4. Run bosh -d NAME-FROM-STEP-2 manifest > cf-manifest.yml

5. Run bosh -d NAME-FROM-STEP-2 deploy cf-manifest.yml -o remove-buffer-pool-limit.yml. You should see an output similar to the following:
Using deployment 'NAME-FROM-STEP-2'

  instance_groups:
  - name: database
    jobs:
    - name: pxc-mysql
      properties:
        engine_config:
-         innodb_buffer_pool_size: "<redacted>"

Continue? [yN]:

6. Ensure there are no changes besides the above removal of the innodb_buffer_pool_size property, then type y and press enter. This will update the TAS MySQL instances only during the deploy.
 
Note: The above changes are not persistent and will be overwritten in the next Apply Changes.


Permanent Fix

The hardcoded buffer pool size of 2GB is removed from the below versions and set to 50% of the RAM of the TAS MySQL node: TAS 2.5.13, TAS 2.6.8, TAS 2.7.2.