ESXi Hosts Show as "Not Responding" Due to Envoy Session Limits Exceeded by Replication Services
search cancel

ESXi Hosts Show as "Not Responding" Due to Envoy Session Limits Exceeded by Replication Services

book

Article ID: 383231

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

  • Scenario 1:
    • For vCenter with replication, ESXi hosts intermittently appear as "Not Responding" in vCenter Server while VMs remain operational. The vCenter web interface may become unresponsive or display HTTP 401 errors. Users accessing VMs through vCenter lose connectivity during these periods.
  • Scenario 2:
    • ESXi hosts intermittently and continuously go from "Not Responding" to "Responding" state on the vCenter Server. Hosts are reachable via SSH and  after restarting management services of the affected hosts, it stops for a short while and then the issue resumes.

 

Impact/Symptoms:

  • ESXi hosts show as "Not Responding" in vCenter Server.
  • Host management interface becomes unreachable, however, ping works.
  • VMs continue running without interruption
  • Issue may affect multiple hosts in sequence
  • Condition resolves temporarily after host reboot

Environment

  • VMware ESXi 7.x and newer
  • VMware vCenter Server 7.x and newer
  • VMware ESXi 8.x
  • VMware vCenter Server 8.x
  • SDDC Manager 5.x and newer
  • Environment using replication services (such as Veeam Replication, vSphere Replication or Nutanix CVM) 

Cause

  • Scenario 1:
    • Replication services can create more HTTPS sessions than it closes, and more than the ESXi host's envoy service can handle. The envoy service has a limit of 128 concurrent HTTPS sessions. When this limit is exceeded, connection failures occur between vCenter and the host.
      Evidence in host envoy logs:
      warning envoy[#######] [Originator@#### sub=filter] [Tags: "ConnectionId":"########"] remote https connections exceed max allowed: 128
  • Scenario 2:
    • Multiple operations running through the SDDC Manager due to password updates/rotation, as well as the LCM, which uses a shared method to connect to ESXi via vCenter. This leads to the ESXi being bloated with too many connection requests.

Resolution

  • Scenario 1:
    • Immediate Work-around:
      1. Identify the replication service creating excessive connections by checking envoy.log for the source IP
      2. Temporarily disable the identified replication service
      3. Restart the envoy service on the affected host:
           /etc/init.d/envoy restart
    • Long-term Solution:
      1. Update replication software to latest version
      2. Configure replication jobs to limit concurrent sessions
      3. If issues persist:
        1. Implement firewall rules to limit concurrent connections from replication servers
        2. Contact replication software vendor for additional guidance
  • Scenario 2:
    • Resolution:
      1. Engineering are aware of the issue and is fix is planned for the future. Please subscribe to this KB to be kept updated.
    • Workaround: 
      1. Restart the envoy service on the affected host:
           /etc/init.d/envoy restart

Additional Information