Vcenter is unresponsive, the services are up, however Envoy-sidecar hits memory limit.
search cancel

Vcenter is unresponsive, the services are up, however Envoy-sidecar hits memory limit.

book

Article ID: 384498

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

vCenter workloads may fail or the customer may fail to log in to vCenter.
Services are up and running, no core dumps created, resolvable via reboot, but missing the reason why.

Symptoms 
-> services are running in VAMI [:5480]
-> some services report healthy with warnings, see examples below 
- vAPI Endpoint service complains about SSO - Failed to retrieve SSO settings/ Failed to login in SSO/ Failed to retrieve VIM service URI from Lookup Service
- The License, vAPI Endpoint and VMware vSphere Profile-Driven Storage Services go into degraded state [healthy with warnings]

Environment

vSphere 8.X 
VCF 5.X 

Cause

Envoy-sidecar is limited to use up to 1GB, listable via configuration file 

# cat ./etc/vmware-envoy-sidecar/config.yaml | grep -C2 1073741824
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.resource_monitors.fixed_heap.v3.FixedHeapConfig
        max_heap_size_bytes: 1073741824 # 1GB
  actions:
    - name: "envoy.overload_actions.disable_http_keepalive"


When it reaches 98% of this memory, it starts sending overload responses, which may cause failures in the vCenter internal workloads. 
Identifying the problem via commands  

# zgrep "503 overload" /var/log/vmware/envoy-sidecar/envoy-access-* | wc -l
If the result is different than 0, then execute:

# zgrep envoy_server_memory_heap_size{} /var/cache/vmware-rhttpproxy/envoy-sidecar-stats/* | cut -d ' ' -f2|  sort -n | uniq | tail -1 | awk '{print $1 >= 1052266987}'
If the above command returns 1, then you hit the envoy-sidecar memory limit.


In websso.log we also find entries like 
<date && time> INFO websso[71:tomcat-http--33] [CorId=487fd2f5-e5c1-4592-b292-f987e3bda94e] [com.vmware.identity.samlservice.impl.ExternalIdpProvider] Got exception (sleeping before retry)
com.vmware.vapi.client.exception.TransportProtocolException: HTTP response with status code 503 (enable debug logging for details): envoy overloaded
   at com.vmware.vapi.internal.protocol.client.rpc.http.ApacheHttpUtil.validateHttpResponse(ApacheHttpUtil.java:101) ~[vapi-runtime-2.100.0.jar:?]

 

Resolution

Issue is being addressed mitigated in vSphere 9 and future 8.X releases.
Recommendations are to Patch to latest Vcenter release, if the issue is repeatable even with said BUILD, then apply the workaround from below:

Workaround:

  1. in ELM VCF follow best practices for ELM change - KB https://knowledge.broadcom.com/external/article/313886/vmware-vcenter-in-enhanced-linked-mode-p.html  

  2. Log in to the VC via SSH

  3. Create a backup of the envoy sidecar config file:
    # cp /etc/vmware-envoy-sidecar/config.yaml /etc/vmware-envoy-sidecar/config.yaml.back

  4. Using sed update the Envoy memory limit from 1073741824 (1 GB) to 2147483648 (2 GB):
    # sed -i 's/max_heap_size_bytes: 1073741824/max_heap_size_bytes: 2147483648/g' /etc/vmware-envoy-sidecar/config.yaml

  5.  Restart envoy-sidecar:
    # service-control --restart envoy-sidecar

  6. [Addendum] We've seen some cases where the 2 Gbs gets filled up as well 
    Developer recommendation is to update from 2 GBs to 4 GBs in such cases -  yes the CLI example is built, from 2 GBs to 4 GBs
    # sed -i 's/max_heap_size_bytes: 2147483648/max_heap_size_bytes: 4294967296/g' /etc/vmware-envoy-sidecar/config.yaml
    # service-control --restart envoy-sidecar
     
  7. Developers are working on a fix, which is planned for a future release.