Traffic is not being evenly balanced across application instances
search cancel

Traffic is not being evenly balanced across application instances

book

Article ID: 297659

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Symptoms:
An application deployed to Cloud Foundry with multiple application instances is receiving traffic but the traffic is not evenly balancing across multiple application instances.

Environment


Cause

Traffic is balanced across application instances using round-robin. This will generally result in an even distribution of requests across application instances but there can be a number of reasons which will prevent this. The following list contains common reasons, the most common are at the top.


Sticky Sessions

By default, Cloud Foundry supports sticky sessions if your application is using a session ID called JSESSIONID (this is the default for Java applications).

When using this session ID, the Gorouter will force traffic for one JSESSIONID to go to the same backend application instance. If a user went to application instance #3 on their first request where the session was created, that user will continue going to application instance #3 for the duration of their session. Since sessions can be long-lived and because not all user sessions will last the same amount of time, this can cause traffic to become unbalanced over time.


Unbalanced workloads

Not all requests are the same. A request to serve a static resource will be very fast but a request to do a complicated task like generate a large report and may take a long time. If you have mixed duration workloads in the same application, this can cause traffic to become unbalanced over time.


Application instance failures

If there is a problem on an individual application instance and that instance does not respond to a request, Gorouter will, under certain circumstances, resend this request to a different application instance. This can cause more traffic to build up on the application instances that are functioning properly.


Diego Cell failures

In some cases, problems can occur when an app is running on a Cell but unable to receive traffic. Because it's running successfully, the app is considered started and stable in Cloud Controller. However, Gorouter cannot deliver traffic to the app instance on the problem Cell. This causes the Gorouter to retry and deliver the traffic to another application instance. This can cause more traffic to build up on the application instances that are running on functioning Cells.


Gorouter Instance Selection Header

While unusual, it is technically possible for a client to request a specific backend app instance by using X-CF-APP-INSTANCE header. If present, this header tells Gorouter to direct requests to a specific app instance. This is normally useful for testing and monitoring purposes but if used incorrectly, this could unbalance requests for a select application instance.

Resolution

Note: The resolutions below are in the same order of the causes listed above.


Sticky Sessions

You can disable sticky sessions by simply changing the name of your session cookie. If it's not JSESSIONID, then sessions will not be sticky. Please be aware that disabling sticky sessions can cause sessions to be lost if you are not replicating session data across all of your application instances. Session data replication is typically done by using a service like Memcached, Redis, or Apache Geode to host your session data.

You can change the session name in a Spring Boot 2+ app by setting
`server.servlet.session.cookie.name = NEWSESSIONID`.
You can also use Spring Session, which uses a different session cookie name by default and also supports storing session data in a service.

Lastly, if you deploy a standard WAR file, you can use the Java Buildpack's Tomcat Session Replication support. This will both store session data in a service and properly configure your session name.


Unbalanced Workloads

Unbalanced workloads are not easy to remedy. This would typically have to be done at the application layer, or perhaps by having two separate applications, one for slow workloads and one for fast workloads. You can then use context path routes to send traffic to the appropriate app for that endpoint.


Application instance failures

You can check individual application instances to make sure that they respond by sending a request and including the X-CF-APP-INSTANCE header. You can then send a request specifically to each application instance to validate that the instances are working. If you find a failed instance, you can restart the individual application instance with the command `cf restart-app-instance`. You may want to debug (capture logs, thread dumps, and application state) prior to restarting in order to try and understand why the application is in this state.


Diego Cell failures

If you see multiple applications or application instances running on the same Cell having issues, then it's possible that the issue is at the Cell level and not the application instance. In this case, Pivotal recommends gathering logs from the Diego Brain, Diego BBS, and the impacted Diego Cell. Please provide these to Pivotal Support in a Support Ticket and the team can help diagnose the issue.

In most cases, doing a `bosh restart` or `bosh recreate` on the Cell will resolve the issue, however Pivotal Support recommends refraining from taking this action until directed by Pivotal Support as it can often make performing a root cause analysis impossible.


Gorouter Instance Selection Header

To resolve, you need to locate the client that is setting the X-CF-APP-INSTANCE header and update the client to not set this header. To locate the client, you can look at the output of `cf logs` for the imbalanced application. The `[RTR]` log entries from the Gorouter will include the remote client's IP (it is the left most IP in the `x_forwarded_for` group) and the application instance (labeled as `app_index') that handled the request. If you notice a single client that is only sending requests to one application instance that would likely be the culprit.