vCenter workloads may fail or the customer may fail to log in to vCenter.
Services are up and running, no core dumps created, resolvable via reboot, but missing the reason why.
Symptoms
<date && time> INFO websso[71:tomcat-http--33] [CorId=487fd2f5-e5c1-4592-b292-12345677890] [com.vmware.identity.samlservice.impl.ExternalIdpProvider] Got exception (sleeping before retry)
com.vmware.vapi.client.exception.TransportProtocolException: HTTP response with status code 503 (enable debug logging for details): envoy overloaded
vSphere 8.X
vSphere 9.0
VCF 5.X
VCF 9.0
Envoy-sidecar is limited to use up to 1GB of memory. This can be seen in
.etc/vmware-envoy-sidecar/config.yaml
# cat ./etc/vmware-envoy-sidecar/config.yaml | grep -C2 1073741824
typed_config:
"@type": type.googleapis.com/envoy.extensions.resource_monitors.fixed_heap.v3.FixedHeapConfig
max_heap_size_bytes: 1073741824 # 1GB
actions:
- name: "envoy.overload_actions.disable_http_keepalive"
When it reaches 98% of this memory, it starts sending overload responses, which may cause failures in the vCenter internal workloads.
You can Identify the problem using the following commands:
zgrep "503 overload" /var/log/vmware/envoy-sidecar/envoy-access-* | wc -l
If the result is different than 0, then execute:
On vCenter 8.0U3 and VCF 5.x:
zgrep envoy_server_memory_heap_size{} /var/cache/vmware-rhttpproxy/envoy-sidecar-stats/* | cut -d ' ' -f2| sort -n | uniq | tail -1 | awk '{print $1 >= 1052266987}'
On vCenter 9.0, VCF 9.x:
zgrep envoy_overload_envoy_resource_monitors_fixed_heap_pressure /var/log/vmware/vstats/metrics/ENVOY_SIDECAR* | grep -v "# TYPE" | cut -d ' ' -f2| sort -n | uniq | tail -1 | awk '{print $1 >= 98}'
If the above command returns 1, then you hit the envoy-sidecar memory limit.
Issue is being addressed in a future release of 8.0 and 9.0.
Recommendation is to patch to latest vCenter release. If the issue is still reproducible, then apply the workaround from below:
Workaround:
# cp /etc/vmware-envoy-sidecar/config.yaml /etc/vmware-envoy-sidecar/config.yaml.back
# sed -i 's/max_heap_size_bytes: 1073741824/max_heap_size_bytes: 2147483648/g' /etc/vmware-envoy-sidecar/config.yaml
# service-control --restart envoy-sidecar
# sed -i 's/max_heap_size_bytes: 2147483648/max_heap_size_bytes: 4294967296/g' /etc/vmware-envoy-sidecar/config.yaml
# service-control --restart envoy-sidecar
In some corner cases even 4 GBs will not be enough. We recommend to completely remove these two actions:
- name: "envoy.overload_actions.stop_accepting_requests"
triggers:
- name: "envoy.resource_monitors.global_downstream_max_connections"
threshold:
value: 0.99
- name: "envoy.resource_monitors.fixed_heap"
threshold:
value: 0.98
- name: "envoy.overload_actions.reject_incoming_connections" triggers:
- name: "envoy.resource_monitors.fixed_heap"
threshold:
value: 1.00
Using vim:# vim /etc/vmware-envoy-sidecar/config.yaml
After the two actions are deleted, the entire section for overload manager in the yaml file should look like this:
overload_manager:
refresh_interval: 1s
resource_monitors:
- name: "envoy.resource_monitors.global_downstream_max_connections"
typed_config:
"@type": type.googleapis.com/envoy.extensions.resource_monitors.downstream_connections.v3.DownstreamConnectionsConfig
max_active_downstream_connections: 8000
- name: "envoy.resource_monitors.fixed_heap"
typed_config:
"@type": type.googleapis.com/envoy.extensions.resource_monitors.fixed_heap.v3.FixedHeapConfig
max_heap_size_bytes: 4294967296 # 4GB
actions:
- name: "envoy.overload_actions.shrink_heap"
triggers:
- name: "envoy.resource_monitors.fixed_heap"
threshold:
value: 0.75
- name: "envoy.overload_actions.disable_http_keepalive"
triggers:
- name: "envoy.resource_monitors.global_downstream_max_connections"
threshold:
value: 0.8
- name: "envoy.resource_monitors.fixed_heap"
threshold:
value: 0.95
- name: "envoy.overload_actions.reduce_timeouts"
triggers:
- name: "envoy.resource_monitors.global_downstream_max_connections"
scaled:
scaling_threshold: 0.25
saturation_threshold: 0.97
- name: "envoy.resource_monitors.fixed_heap"
scaled:
scaling_threshold: 0.85
saturation_threshold: 0.97
typed_config:
"@type": type.googleapis.com/envoy.config.overload.v3.ScaleTimersOverloadActionConfig
timer_scale_factors:
- timer: HTTP_DOWNSTREAM_CONNECTION_IDLE
min_timeout: 2s
Save the file and restart sidecar service:# service-control --restart envoy-sidecar