TAS MySQL CPU spikes after upgrading from TAS v2.11 to v2.13
search cancel

TAS MySQL CPU spikes after upgrading from TAS v2.11 to v2.13

book

Article ID: 298404

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Some large environments are observing Tanzu Application Service (TAS) MySQL VM CPU spikes after upgrading from TAS v2.11 to v2.13. 

When TAS MySQL experiences CPU spikes it may result in performance degradations such as AppsManager slowness or cf CLI failures. 

There have been 3 observed reasons for these TAS MySQL CPU spikes and we will briefly cover them in this KB. Its important to note the CPU spikes we will be discussing in this KB are influenced from the ccdb client which is Cloud Controller.

Environment

Product Version: 2.13

Resolution

The Cloud Controller v3 API has been available in TAS for an extended period and has been exercised by the v7 cf CLI since it was released in TAS 2.10. Scale tests have demonstrated that the v3 API can be up to three times faster than equivalent calls on the v2 API used in TAS 2.11.

Reason 1 for TAS MySQL CPU spike
The top-level roles resource is newer to the v3 API (/v3/roles), and thus did not have initially the same real-world validation as endpoints with direct v2 API equivalents. Although this endpoint went through several query optimizations first appearing in TAS 2.13, it was still found to be impactful to foundations with large db tables.  The ccdb tables specifically are:
  • spaces_auditors
  • spaces_developers
  • spaces_managers
  • spaces_supporters
  • organizations_auditors
  • organizations_billing_managers
  • organizations_managers
  • organizations_users
This has since been patched and is Generally Available starting in CAPI release v1.127.11 which is package with TAS v2.13.16+.


Reason 2 for TAS MySQL CPU spike
Apps Manager did not upgrade to fully use the v3 API until TAS 2.13. To provide a comprehensive view of a user's roles, Apps Manager fetches roles that a user has in a given TAS foundation. To keep this data up-to-date, Apps Manager re-fetches the user's roles every thirty seconds. When viewing user roles in each organization or space, Apps Manager will fetch all roles for all the users in that organization or space. These unfiltered queries can result in large response payloads for environments where users have many roles. The polling logic in Apps Manager did not account for cases where the time to fetch roles exceeded the polling interval. In cases like this, a second poll is initiated, even though the first one has not completed yet, compounding to further increase API load. The large numbers of user roles meant that each individual poll from Apps Manager resulted in hundreds of pages of results. Paging through all these results took longer than Apps Manager's polling interval, which resulted in multiple simultaneous fetches from a single instance of Apps Manager resulting in a positive feedback loop of queries that saturated the Cloud Controller's database's CPU, further slowing down queries, which further increased the number of simultaneous requests from Apps Manager. 

This was patched via:
  1. Applying filters to reduce the total number of roles fetched by Apps Manager, thereby speeding up requests and reducing API load.
  2. Prevent Apps Manager from issuing multiple concurrent requests to the API for the same data. This circuit breaker will help avoid positive feedback loops with negative outcomes.

This patch is Generally Available starting in push-apps-manager-release v676.0.7 which is package with TAS v2.13.13+.

Reason 3 for TAS MySQL CPU spike
As previously mentioned - Apps Manager did not upgrade to fully use the v3 API until TAS 2.13. This includes other applications within the push-apps-manager-release such as search-server. The search-server application fetches data from TAS MySQL on behalf of AppsManager. When a user clicks the search bar in AppsManager the following takes place:
  1. AppsManager requests search-server to fetch all organizations, spaces, apps, and service instances.
  2. Search-server begins a series of CAPI API calls to fetch this data (the following log snippits are cut from a TAS v2.13 environment):
    GET /v3/organizations?page=1
    GET /v3/organizations?page=2
    <continued organizations calls until final page>
    GET /v3/spaces?page=1
    GET /v3/spaces?page=2
    <continued spaces calls until final page>
    GET /v3/apps?page=1
    GET /v3/apps?page=2
    <continued apps calls until final page>
    GET /v3/service_instances?page=1
    GET /v3/service_instances?page=2
    <continued service_instances calls until final page>
    
  3. Data is loaded in search-server and made available to AppsManager's search functionality for improved User Experience.

Prior to TAS v2.13 search-server used the v3 CAPI API endpoints for all objects except for the service instances. For the service instances it used the v2 CAPI API:
GET /v2/service_instances?page=1
GET /v2/service_instances?page=2

Starting in TAS v2.13 search-server began using the v3 CAPI API for service instance objects.  The /v3/service_instances CAPI API endpoint by itself is typically very fast. However, it has been observed when many /v3/service_instances CAPI calls occur in rapid succession then it can lead to performance degradation and slow queries. This is exactly what happens when clicking the search bar in AppsManager. Search-server application will try to fetch all of the service instances from CAPI and this leads to several simultaneous /v3/service_instances?page=X calls which may spike TAS MySQL CPU load on foundations with a large amount of service instances (multiple thousands) as it only requests 50 items per page. This github issue may be related.

This has since been patched by allowing more than 50 items per page from search-server thus greatly reducing the number of concurrent API calls to CAPI. This patch is Generally Available starting in push-apps-manager-release v676.0.11+ which is package with TAS v2.13.20+. This setting is the environment variable API_PER_PAGE on the search-server and apps-manager applications. This API_PER_PAGE environment variable still defaults to 50 but is now configurable up to 5000. At this time, the property has not been exposed as configurable in the tile or platform automation. The plan is to make the default value for this property higher in future TAS releases instead of making it configurable as a property. At this time, the recommendation is to update the environment variable on the search-server and apps-manager applications manually after each push-apps-manager errand run until future TAS versions increase the default value.

Conclusion
TAS v2.13.20+ contains vital patches for AppsManager and CAPI performance improvements in foundations with many users, roles, and service instances. 

The following metrics for the TAS MySQL VMs are helpful in tracking performance
#mysql 
Origin: mysql	-	Name: /mysql/performance/slow_queries

#system
Origin: system_metrics_agent	-	Name: system_cpu_user
Origin: system_metrics_agent    -       Name: system_mem_percent
Origin: system_metrics_agent	-	Name: system_load_1m
Origin: system_metrics_agent	-	Name: system_load_5m
Origin: system_metrics_agent	-	Name: system_load_15m