Autoscaling using HTTP Throughput & Latency metrics

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Tanzu Autoscaler offers two types of rules for scaling based on HTTP requests sent to your application:
latency or throughput. This article describes why VMware recommends not using throughput-based scaling rules.

This article applies to the following versions of Tanzu Application Service:

Any version 2.6.x and older
2.7.0 to 2.7.14 inclusive
2.8.0 to 2.8.8 inclusive
2.9.0 to 2.9.2 inclusive

The issue prompting the recommendations in this KB has been resolved in 2.7.15, 2.8.9 and 2.9.3. If you are running those versions or newer, then this KB does not apply.

Resolution

VMware does not recommend scaling based on HTTP throughput.

Options:
Throughput scaling has some drawbacks, VMware recommends HTTP latency scaling instead. This will generally be more accurate because latency will increase when load and work increases and you can define latency in terms of business goals which are decoupled from the resources available on the platform.

For example, you can have a business goal of application response time being less than 300ms. To help facilitate this, you can add an autoscaling rule which scales up additional application instances when the latency hits 250ms, providing more application instances to handle requests which will generally lower latency before your business goal is impacted.

HTTP latency based scaling is not perfect however. In some cases, like with microservices, latency can be more a factor of downstream dependencies (i.e. other microservices) than it is of the current application. If you have a slow downstream dependency, this can increase the latency of your application as well. At the same time, scaling up your application will not help with performance as the downstream dependency is what really needs to be scaled up or improved.

Latency added by other external factors like network congestion or database performance can also cause issues with HTTP latency based scaling because increased latency from these factors will not be impacted by scaling up application instances.
VMware Autoscaler also provides functionality to support scaling based on custom metrics. If HTTP latency based scaling does not fit your use case, custom metrics could be exposed by your application to use for the purpose of triggering autoscaling rules. These rules can be more specific to your business use case and provide better signals on which you can autoscale your application. They can also be combined with HTTP latency based scaling rules to improve the accuracy of your scaling rules.

Throughput scaling is simple in concept, you count the number of requests hitting the app and scale up if you exceed that number of requests during a defined period of time. In practice, it is not that simple.

Here are some points you may wish to consider before using throughput-based scaling:

Not all requests are created equal. If a request to one endpoint takes longer to process or requires more work than a request to a second endpoint then you won't be able to sustain the same number requests per second to the slower or more expensive of the two endpoints. This means that you have to make estimates as to how much of your throughput is destined for one endpoint versus another, which gets more complicated and fragile as the number of endpoints increases. Additionally, if your request workload shifts, you need to update those estimates or your app may not scale up before it starts to have problems.
Understanding when to scale your application up based on throughput requires that you have done some load testing so that you know how many requests per second your app can handle before you need to scale it up. As your application changes, you need to continue load testing to make sure that your scaling limits have not changed to due code changes in the application.
Load testing is difficult on Cloud Foundry because applications do not have dedicated CPU resources. Because CPU resources can vary, the number of requests per second that your application can handle will also likely vary. In short, the traffic load you observe during your load tests may not be the same as what you can handle in production.

Additional Information

All versions of Tanzu Application Service (TAS) before 2.7.15, 2.8.9 and 2.9.3 are affected by a known bug where Autoscaler times out while fetching metric envelopes from LogCache under heavy log volume. This heavily impacts HTTP throughput calculations and can cause autoscaling throughput-based rules to not fire. Please see the referenced KB for additional details.