In the Elastic Runtime Network, a user may not be able to pass the Spring Cloud Services (SCS) smoke tests when attempting to change the Router Timeout to Backends (in seconds) to XXX seconds under the 4-minute default for Azure networking.
Background
The HAProxy and Gorouter use the same top-level manifest property:
properties: request_timeout_in_seconds: 900
The Elastic Runtime tile enables configuration of this property, it is applied to both HAProxy and Gorouter. This setting should still be under 4 minutes (240 seconds) for Azure deployments. Our testing uses a value of 160 seconds.
The original problem that lead to an investigation involved intermittent issues with Spring Cloud Services (SCS) connections during smoke test runs. The problem is even more likely to happen when real users are creating and updating service instances because it is likely that the SCS broker will be idle for more than 4 minutes between user requests. This was identified initially when the SCS service broker logs showed the message “Timed out waiting for a connection”.
After investigations which involved SCS, Ecosystem, Diego, BOSH, Garden, and Microsoft Azure team members, the underlying issue was discovered. Any resource with a Public IP endpoint on Azure, such as an Azure Load Balancer (ALB), has a default idle connection timeout of 4 minutes. When Azure detects that a connection has been idle for more than 4 minutes, it closes the connection without sending a TCP_RESET
to inform the client side that the connection has been closed.
This problem occurs due to the timeout for connections from the router or HAProxy to applications and system components. Increase this to accommodate larger uploads over connections with high latency.
Two solutions exist for this problem, refer to the information below for more information.
Set idle timeouts for all Azure Public IPs from 4 minutes to 30 minutes.
You can do this through the Azure Portal under Load Balancers -> Load Balancing Rules. You need to navigate through each load balancer as well as each load balancer rule in order to bump up the "Idle Timeout (minutes)" from 4 to 30.
Alternatively, you can create a script file to run the following commands to workaround this problem:
#!/bin/bash for pip in $( azure network public-ip list -g $1 | grep 'data' | tail +3 | cut -f 5 -d ' ' ) do azure network public-ip set -g $1 -n $pip -i 30 done
To run the script, refer to the example below:
./arbitrary script name>.sh <resource group name>
Run the script above with the script below:
./set-timeouts.sh pcfresourcegroup
In addition to setting idle timeouts for all Azure public IPs to 30 minutes, you can resolve this issue by setting "Frontend Idle Timeout for Gorouter and HAProxy" to less than 4 minutes (240 seconds).
To do this, follow the instructions below: