HTTP throughput based Autoscaling rules do not fire
search cancel

HTTP throughput based Autoscaling rules do not fire

book

Article ID: 297739

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Symptoms:
When there is a high volume of metric envelopes that Autoscaler must process to determine if your application should be scaled, the LogCache client will not fetch all of these envelopes. This will manifest in autoscaling rules not firing, even when the criteria for those rules appear, based on external sources, to have been met.

For example, if you are running a load test on your application and sending a known quantity of requests per second to the application, you may observe that autoscaling rules do not fire and that Autoscaler reports the request per second to be significantly lower than the known or actual value.

Environment


Cause

There are three reasons why LogCache client will not fetch all of these envelopes. 

1. The primary cause is a timeout that is configured on the LogCache client which is used by Autoscaler. It is a timeout for the total time that the Autoscaler has to use the LogCache client to walk the results. It is currently a 5 second timeout.

During a load test or for a high volume app, the number of envelopes that need to be walked is large and this doesn’t happen fast enough. For example, if you’re sending 150 requests per second, you’re generating 18,000 timer envelopes for a 2 minute period. Autoscaler can only pull batches of 1000 envelopes at a time, which means that there’s roughly 18 round trips between Autoscaler and LogCache. In addition, the time it takes to process those needs to happen under 5 seconds. If it doesn’t, then only the envelopes that are processed during that 5 second period are included in the calculation for throughput. In some scenarios, this causes large chunks of logs to be missing from the calculation, skewing the results.

This issue has been resolved in 2.7.15, 2.8.9 and 2.9.3. If you are running those versions or newer, then this issue does not apply.

2. The second issue is due to ingestion time. If you ask LogCache for everything from the last two minutes up to the current moment in time, there’s no guarantee that you’ll actually get everything up to the current second. It takes time for the transit of messages through Loggregator and for the ingestion of messages into LogCache. This creates a delay and any messages during this deploy period are not included in the calculations performed by Autoscaler.

This issue has been resolved in 2.7.15, 2.8.9 and 2.9.3. If you are running those versions or newer, then this issue does not apply.

3. Finally, Loggregator is designed to drop envelopes in order to maintain performance at heavy load. In the case of Loggregator being saturated with logs, it will continue to deliver logs. Autoscaler may still be able to walk LogCache within the timeout window, find that all messages are up to date, and still not have a complete set of metrics envelopes for that scaling window.

Resolution

The first and second issues can be resolved by upgrading to 2.7.15+, 2.8.9+ or 2.9.3+.

The third issue can be remedied by scaling up Loggregator in your environment.