vCenter is unresponsive, vAPI endpoint service degraded, and logins fail until a reboot when envoy-sidecar hits memory limit.
search cancel

vCenter is unresponsive, vAPI endpoint service degraded, and logins fail until a reboot when envoy-sidecar hits memory limit.

book

Article ID: 384498

calendar_today

Updated On:

Products

VMware vCenter Server VMware vCenter Server 8.0

Issue/Introduction

vCenter workloads may fail or the customer may fail to log in to vCenter.
Services are up and running, no core dumps created, resolvable via reboot, but missing the reason why.


Symptoms

  • Users are unable to log into vCenter until it is rebooted. Issue reoccurs some time afterwards.
  • Some services report healthy with warnings, see examples below
    • vAPI Endpoint
      • Failed to retrieve SSO settings
      • Failed to login in SSO
      • Failed to retrieve VIM service URI from Lookup Service
  • The License, vAPI Endpoint and VMware vSphere Profile-Driven Storage services go into degraded state [healthy with warnings]
  • In websso.log, you see envoy overloaded messages
    <date && time> INFO websso[71:tomcat-http--33] [CorId=487fd2f5-e5c1-4592-b292-12345677890] [com.vmware.identity.samlservice.impl.ExternalIdpProvider] Got exception (sleeping before retry)
       com.vmware.vapi.client.exception.TransportProtocolException: HTTP response with status code 503 (enable debug logging for details): envoy overloaded

Environment

vSphere 8.X 
vSphere 9.0
VCF 5.X 
VCF 9.0

Cause

Envoy-sidecar is limited to use up to 1GB of memory. This can be seen in etc/vmware-envoy-sidecar/config.yaml.

# cat ./etc/vmware-envoy-sidecar/config.yaml | grep -C2 1073741824
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.resource_monitors.fixed_heap.v3.FixedHeapConfig
        max_heap_size_bytes: 1073741824 # 1GB
  actions:
    - name: "envoy.overload_actions.disable_http_keepalive"


When it reaches 98% of this memory, it starts sending overload responses, which may cause failures in the vCenter internal workloads. 

You can Identify the problem using the following commands:  

zgrep "503 overload" /var/log/vmware/envoy-sidecar/envoy-access-* | wc -l

If the result is different than 0, then execute:

On vCenter 8.0U3 and VCF 5.x:

zgrep envoy_server_memory_heap_size{} /var/cache/vmware-rhttpproxy/envoy-sidecar-stats/* | cut -d ' ' -f2|  sort -n | uniq | tail -1 | awk '{print $1 >= 1052266987}'

On vCenter 9.0, VCF 9.x:

zgrep envoy_overload_envoy_resource_monitors_fixed_heap_pressure /var/log/vmware/vstats/metrics/ENVOY_SIDECAR* | grep -v "# TYPE" | cut -d ' ' -f2|  sort -n | uniq | tail -1 | awk '{print $1 >= 98}'

If the above command returns 1, then you hit the envoy-sidecar memory limit.

Resolution

Issue is being addressed in a future release of 8.0 and 9.0.
Recommendation is to patch to latest vCenter release. If the issue is still reproducible, then apply the workaround from below:

Workaround:

  1. Take snapshots of the vCenter or ELM vCenter group (See VMware vCenter in Enhanced Linked Mode pre-changes snapshot (online or offline) best practice for guidance on taking snapshots of vCenters in ELM)

  2. Log in to the vCenter via SSH.

  3. Create a backup of the envoy sidecar config file:
    # cp /etc/vmware-envoy-sidecar/config.yaml /etc/vmware-envoy-sidecar/config.yaml.back

  4. Using sed update the Envoy memory limit from 1073741824 (1 GB) to 2147483648 (2 GB):
    # sed -i 's/max_heap_size_bytes: 1073741824/max_heap_size_bytes: 2147483648/g' /etc/vmware-envoy-sidecar/config.yaml

  5.  Restart envoy-sidecar:
    # service-control --restart envoy-sidecar

  6. Some cases have show that 2Gbs is not enough. Recommendation is to update from 2 GBs to 4 GBs in such cases
    # sed -i 's/max_heap_size_bytes: 2147483648/max_heap_size_bytes: 4294967296/g' /etc/vmware-envoy-sidecar/config.yaml
    # service-control --restart envoy-sidecar

  7. In some corner cases even 4 GBs will not be enough. We recommend to completely remove these two actions:

       - name: "envoy.overload_actions.stop_accepting_requests"
          triggers:
            - name: "envoy.resource_monitors.global_downstream_max_connections"
              threshold:
                value: 0.99
            - name: "envoy.resource_monitors.fixed_heap"
              threshold:
                value: 0.98
     
      - name: "envoy.overload_actions.reject_incoming_connections"

          triggers:
            - name: "envoy.resource_monitors.fixed_heap"
              threshold:
                value: 1.00

    Using vim:
    # vim /etc/vmware-envoy-sidecar/config.yaml

    After the two actions are deleted, the entire section for overload manager in the yaml file should look like this:

    overload_manager:
      refresh_interval: 1s
      resource_monitors:
        - name: "envoy.resource_monitors.global_downstream_max_connections"
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.resource_monitors.downstream_connections.v3.DownstreamConnectionsConfig
            max_active_downstream_connections: 8000
        - name: "envoy.resource_monitors.fixed_heap"
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.resource_monitors.fixed_heap.v3.FixedHeapConfig
            max_heap_size_bytes: 4294967296 # 4GB
      actions:
        - name: "envoy.overload_actions.shrink_heap"
          triggers:
            - name: "envoy.resource_monitors.fixed_heap"
              threshold:
                value: 0.75
        - name: "envoy.overload_actions.disable_http_keepalive"
          triggers:
            - name: "envoy.resource_monitors.global_downstream_max_connections"
              threshold:
                value: 0.8
            - name: "envoy.resource_monitors.fixed_heap"
              threshold:
                value: 0.95
        - name: "envoy.overload_actions.reduce_timeouts"
          triggers:
            - name: "envoy.resource_monitors.global_downstream_max_connections"
              scaled:
                scaling_threshold: 0.25
                saturation_threshold: 0.97
            - name: "envoy.resource_monitors.fixed_heap"
              scaled:
                scaling_threshold: 0.85
                saturation_threshold: 0.97
          typed_config:
            "@type": type.googleapis.com/envoy.config.overload.v3.ScaleTimersOverloadActionConfig
            timer_scale_factors:
              - timer: HTTP_DOWNSTREAM_CONNECTION_IDLE
                min_timeout: 2s

    Save the file and restart sidecar service:
    # service-control --restart envoy-sidecar

  8. Developers are working on a fix, which is planned for a future release.