Application deployments are failing intermittently with response code 500 and Internal Server error
search cancel

Application deployments are failing intermittently with response code 500 and Internal Server error

book

Article ID: 416458

calendar_today

Updated On:

Products

VMware Tanzu Application Service

Issue/Introduction

Running 'cf push' and/or other cf cli commands (e.g., cf curl, cf apps, etc.) are failing intermittently with response code 500 and Internal Server error.  

Example of the error output after running the cf command:

Unexpected Response
Response Code: 500
Request ID:    xxx-yyy-zz
Code: 0, Title: , Detail: Internal server error
FAILED

 

The cloud_controller_ng job log file (cloud_controller_ng.log) shows these failures, which indicate that there is a problem with MySQL database connectivity.

{"timestamp":"2025-10-22T07:39:31.789616283Z","message":"Request failed: 500: {\"errors\"=\u003e[{\"title\"=\u003e\"UnknownError\", \"detail\"=\u003e\"An unknown error occurred.\", \"code\"=\u003e10001, \"test_mode_info\"=\u003e{\"detail\"=\u003e\"Mysql2::Error::ConnectionError: Lost connection to server during query\", \"title\"=\u003e\"CF-DatabaseError\", \"backtrace\"=\u003e[\"/var/vcap/data/packages/cloud_controller_ng/5ea87a46f6ed7830ec8fee5b50e44e977101639c/gem_home/ruby/3.2.0/gems/mysql2-0.5.6/lib/mysql2/client.rb:151:in `_query'\",...

{"timestamp":"2025-10-22T07:39:32.898709623Z","message":"Request failed: 500: {\"error_code\"=\u003e\"UnknownError\", \"description\"=\u003e\"An unknown error occurred.\", \"code\"=\u003e10001, \"test_mode_info\"=\u003e{\"description\"=\u003e\"MySQL client is not connected\", \"error_code\"=\u003e\"CF-Error\", \"backtrace\"=\u003e[\"/var/vcap/data/packages/cloud_controller_ng/5ea87a46f6ed7830ec8fee5b50e44e977101639c/gem_home/ruby/3.2.0/gems/sequel-5.95.1/lib/sequel/adapters/mysql2.rb:303:in `escape'\",...

{"timestamp":"2025-10-22T07:39:33.489456235Z","message":"Request failed: 500: {\"description\"=\u003e\"Database error\", \"error_code\"=\u003e\"CF-DatabaseError\", \"code\"=\u003e10011, \"test_mode_info\"=\u003e{\"description\"=\u003e\"Database error\", \"error_code\"=\u003e\"CF-DatabaseError\", \"backtrace\"=\u003e[\"/var/vcap/data/packages/cloud_controller_ng/5ea87a46f6ed7830ec8fee5b50e44e977101639c/cloud_controller_ng/app/controllers/base/base_controller.rb:88:in `rescue in dispatch'\",...

 

The mysql instances are all up and running.  Running the mysql-diag from the mysql_proxy instances also show that the mysql servers are fine.  However, there are a lot of slow queries being logged in the mysql_slow_query log file (found in the log directory of the active mysql instance).  Also, the CPU utilization of the active mysql instance is very high.

Environment

VMware Tanzu Application Service

Cause

The MySQL server was getting overwhelmed with requests from various components and its CPU utilization was very high.  The vm type of the MySQL instance could be too small for the current workload.

It is possible that there is a SQL query that is being run repetitively and could be due to an unintended unusual activity from a rogue process. 

Resolution

Review the mysql_slow_query logs.  Use a tool such as "pt-query-digest" to see if there is anything that is hammering the DB.  

Example usage of "pt-query-digest" inside the mysql VM:

$ apt update
$ apt install percona-toolkit
$ cd /var/vcap/sys/log/pxc-mysql
$ pt-query-digest mysql_slow_query*

# 90.3s user time, 1.9s system time, 84.33M rss, 89.70M vsz
# Current date: Thu Oct 30 17:14:36 2025
# Hostname: xxx
# Files: mysql_slow_query, mysql_slow_query.1, mysql_slow_query.2, mysql_slow_query.3, mysql_slow_query.4, mysql_slow_query.5
# Overall: 336.21k total, 30 unique, 2.11 QPS, 30.43x concurrency ________
# Time range: 2025-10-22T20:13:45 to 2025-10-24T16:31:45
# Attribute          total     min     max     avg     95%  stddev  median
# ============     ======= ======= ======= ======= ======= ======= =======
# Exec time        4853122s     10s     73s     14s     20s      3s     13s
# Lock time        4823238s       0     42s     14s     20s      4s     13s
# Rows sent         91.93k       0   7.55k    0.28       0   42.09       0
# Rows examine       6.30M       0 324.55k   19.66    0.99   1.75k    0.99
# Rows affecte     338.83k       0    1000    1.03    0.99    6.44    0.99
# Bytes sent       232.02M      11  22.68M  723.64   51.63 111.08k   51.63
# Merge passes           7       0       1    0.00       0    0.00       0
# Tmp tables           210       0      11    0.00       0    0.06       0
# Tmp disk tbl           0       0       0       0       0       0       0
# Tmp tbl size      63.75M       0  14.38M  198.83       0  31.58k       0
# Query size        61.16M       6  47.23k  190.75  183.58  279.53  183.58
# InnoDB:
# IO r bytes       536.47M       0 115.20M   1.63k       0 325.76k       0
# IO r ops          33.53k       0   7.20k    0.10       0   20.38       0
# IO r wait           166s       0     39s   492us       0    94ms       0
# pages distin       2.98M       1   7.99k    9.28    8.91   73.19    7.70
# queue wait             0       0       0       0       0       0       0
# rec lock wai     4823932s       0     42s     14s     20s      4s     13s
# Boolean:
# Filesort       0% yes,  99% no
# Filesort on    0% yes,  99% no
# Full scan      0% yes,  99% no
# Tmp table      0% yes,  99% no

# Profile
# Rank Query ID                    Response time      Calls  R/Call  V/M
# ==== =========================== ================== ====== ======= =====
#    1 0xDDEFDF2EE4728370EFB49F... 4847184.8663 99.9% 335792 14.4351  0.84 UPDATE users
#    2 0xFE718F42D2CE92BF3EA004...    2766.2305  0.1%    191 14.4829  0.93 SELECT apps
#    3 0xEE68700C08A7915B8A40E6...     899.1293  0.0%     74 12.1504  0.25 SELECT oauth_client_details
#    4 0x874081B88B3A0513F07965...     456.2969  0.0%     17 26.8410  6.91 SELECT events
#    5 0xBAD335EE7339CA94F87194...     329.4343  0.0%     27 12.2013  0.36 SELECT desired_lrps actual_lrps
#    6 0x4469751F48DF90DDDA20CE...     203.1477  0.0%     14 14.5105  0.86 SELECT UNION routes spaces route_shares spaces spaces_developers spaces_managers spaces_auditors spaces_supporters organizations_managers organizations_auditors spaces spaces_developers spaces_managers spaces_auditors spaces_supporters organizations_managers organizations_auditors
#    7 0x78EF56B6A1A3675F1FCE74...     188.8896  0.0%     15 12.5926  0.28 DELETE events
#    8 0x809B122EA77055CE4049A2...     166.4541  0.0%     14 11.8896  1.22 SELECT processes
# MISC 0xMISC                          927.4485  0.0%     66 14.0522   0.0 <22 ITEMS>

# Query 1: 3.01 QPS, 43.50x concurrency, ID 0xDDEFDF2EE4728370EFB49FD1957A3DB4 at byte 9346767
# Scores: V/M = 0.84
# Time range: 2025-10-23T09:34:28 to 2025-10-24T16:31:45
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         99  335792
# Exec time     99 4847185s     10s     42s     14s     20s      3s     13s
# Lock time     99 4820439s      4s     42s     14s     20s      3s     13s
# Rows sent      0       0       0       0       0       0       0       0
# Rows examine   5 327.92k       1       1       1       1       0       1
# Rows affecte  95 324.18k       0       1    0.99    0.99    0.11    0.99
# Bytes sent     7  16.75M      52      79   52.31   51.63    2.62   51.63
# Merge passes   0       0       0       0       0       0       0       0
# Tmp tables     0       0       0       0       0       0       0       0
# Tmp disk tbl   0       0       0       0       0       0       0       0
# Tmp tbl size   0       0       0       0       0       0       0       0
# Query size    98  60.20M     188     188     188     188       0     188
# InnoDB:
# IO r bytes     0       0       0       0       0       0       0       0
# IO r ops       0       0       0       0       0       0       0       0
# IO r wait      0       0       0       0       0       0       0       0
# pages distin  86   2.59M       5      14    8.08    8.91    0.48    7.70
# queue wait     0       0       0       0       0       0       0       0
# rec lock wai  99 4821133s      4s     42s     14s     20s      3s     13s
# String:
# Databases    uaa
# Hosts        x.x.x.x (124402/37%)... 2 more
# InnoDB trxID 1A5407CF3 (1/0%), 1A5407DDE (1/0%)... 335790 more
# Last errno   0 (331964/98%), 1213 (3828/1%)
# Users        xxxxx
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms
#  10ms
# 100ms
#    1s
#  10s+  ################################################################
# Tables
#    SHOW TABLE STATUS FROM `uaa` LIKE 'users'\G
#    SHOW CREATE TABLE `uaa`.`users`\G
update users set previous_logon_success_time = last_logon_success_time, last_logon_success_time = 1761231622043 where id = '9999xxxx' and identity_zone_id='uaa'\G
# Converted for EXPLAIN
# EXPLAIN /*!50100 PARTITIONS*/
select  previous_logon_success_time = last_logon_success_time, last_logon_success_time = 1761231622043 from users where  id = '9999xxxx' and identity_zone_id='uaa'\G

# Query 2: 0.00 QPS, 0.03x concurrency, ID 0xFE718F42D2CE92BF3EA004B39E621992 at byte 19690211
# Scores: V/M = 0.93
# Time range: 2025-10-23T09:35:14 to 2025-10-24T15:35:18
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          0     191
# Exec time      0   2766s     10s     27s     14s     21s      4s     13s
# Lock time      0   2758s     10s     27s     14s     21s      4s     13s
# Rows sent      0     191       1       1       1       1       0       1
# Rows examine   0     191       1       1       1       1       0       1
# Rows affecte   0       0       0       0       0       0       0       0
# Bytes sent     0 404.38k   2.12k   2.12k   2.12k   2.12k       0   2.12k
# Merge passes   0       0       0       0       0       0       0       0
# Tmp tables     0       0       0       0       0       0       0       0
# Tmp disk tbl   0       0       0       0       0       0       0       0
# Tmp tbl size   0       0       0       0       0       0       0       0
# Query size     0  11.38k      61      61      61      61       0      61
# InnoDB:
# IO r bytes     0       0       0       0       0       0       0       0
# IO r ops       0       0       0       0       0       0       0       0
# IO r wait      0       0       0       0       0       0       0       0
# pages distin   0     198       1       2    1.04    0.99    0.18    0.99
# queue wait     0       0       0       0       0       0       0       0
# rec lock wai   0   2758s     10s     27s     14s     21s      4s     13s
# String:
# Databases    ccdb
# Hosts        x.x.x.x (72/37%), y.y.y.y (61/31%)... 1 more
# InnoDB trxID 1A540D736 (1/0%), 1A540D771 (1/0%)... 189 more
# Last errno   0
# Users        zzzzz
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms
#  10ms
# 100ms
#    1s
#  10s+  ################################################################
# Tables
#    SHOW TABLE STATUS FROM `ccdb` LIKE 'apps'\G
#    SHOW CREATE TABLE `ccdb`.`apps`\G
# EXPLAIN /*!50100 PARTITIONS*/
SELECT * FROM `apps` WHERE (`id` = 167779) LIMIT 1 FOR UPDATE\G

# Query 3: 0.00 QPS, 0.01x concurrency, ID 0xEE68700C08A7915B8A40E674828759FD at byte 10860169
.....

 

The above example report shows that there is 1 query that is standing out compared to the other queries.  In the report, it shows that there have 335,792 calls of that query (Query 1).  The report contains the actual query as well:

update users set previous_logon_success_time = last_logon_success_time, last_logon_success_time = 1761231622043 where id = '9999xxxx' and identity_zone_id='uaa'\G

 

From the above example, it can be inferred that the query is coming from the UAA job, and the id `9999xxxx` can then be searched from the UAA logs.  In this example, the id 9999xxxx is traced to a user 'TESTUSER" and the logins are coming from the IP address shown as the value of 'remoteAddress'.  

$ grep 9999xxxx /var/vcap/sys/log/uaa/uaa_events*
[2025-10-30T18:07:31.839264Z] uaa - 12 [https-jsse-nio-8443-exec-1978] - [xxx] ....  INFO --- Audit: IdentityProviderAuthenticationSuccess ('TESTUSER'): principal=9999xxxx, origin=[remoteAddress=10.x.y.z, clientId=cf], identityZoneId=[uaa], authenticationType=[ldap]
[2025-10-30T18:07:31.839434Z] uaa - 12 [https-jsse-nio-8443-exec-1978] - [xxx] ....  INFO --- Audit: UserAuthenticationSuccess ('TESTUSER'): principal=9999xxxx, origin=[remoteAddress=10.x.y.z, clientId=cf], identityZoneId=[uaa]
[2025-10-30T18:07:31.910545Z] uaa - 12 [https-jsse-nio-8443-exec-1978] - [xxx] ....  INFO --- Audit: TokenIssuedEvent ('["cloud_controller.write","openid","scim.read","uaa.user","cloud_controller.read","password.write","scim.write"]'): principal=9999xxxx, origin=[client=cf, user=TESTUSER], identityZoneId=[uaa]
...

 

The logins are seen to be happening multiple times per minute, which explains the volume of the SQL queries.  The first command below shows the username and how many times it was logged to have successfully authenticated.  The second command shows the unique remoteAddress values and the corresponding count.

$ grep 9999xxxx /var/vcap/sys/log/uaa/uaa_events* | grep UserAuthenticationSuccess | awk '{print $13}' | uniq -c
 414068 ('TESTUSER'):
$ grep 9999xxxx /var/vcap/sys/log/uaa/uaa_events* | grep UserAuthenticationSuccess | awk -F= '{print $4}' | awk -F, '{print $1}' | sort | uniq -c
 137977 10.x.y.z
 130807 10.x.y.a
 145284 10.x.y.b
$

 

Now, that the user account and source has been identified, then mitigation should be carried out accordingly.  Once the unusual repetitive logins are stopped, then the MySQL performance (CPU usage and slow queries) can be re-assessed.

In case that it is deemed that the MySQL instance is underpowered, then the VM type can be changed to a larger one with more CPUs and more RAM.