Envoy Session Limits Exceeded by Replication Services causing ESXi Hosts to show as 'Not Responding'
search cancel

Envoy Session Limits Exceeded by Replication Services causing ESXi Hosts to show as 'Not Responding'

book

Article ID: 383231

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

  • In vCenter Server environments utilizing replication, ESXi hosts intermittently enter a "Not Responding" state, resulting in a loss of VM console connectivity. Simultaneously, the vCenter Server web interface may become unresponsive or return HTTP 401 Unauthorized errors, even though the Virtual Machines remain operational.

  • ESXi hosts exhibit a "flapping" behavior within the vCenter Server inventory, continuously cycling between "Not Responding" and "Responding" states. While the hosts remain accessible via SSH, restarting the management agents provides only temporary mitigation before the issue recurs.

  • The ESXi host management interface becomes unreachable, though the host remains responsive to ping and virtual machine workloads continue without interruption. This condition is observed to affect multiple hosts in sequence. Rebooting the affected host resolves the issue temporarily.
  • In /var/run/log/envoy.log on ESXi host, we see below entries
    YYYY-MM-DDTHH:MM:SS.Z In(166) envoy[2106882]: "YYYY-MM-DDTHH:MM:SS.Z warning envoy[2107532] [Originator@6876 sub=filter] [Tags: "ConnectionId":"######"] remote https connections exceed max allowed: 128"

Environment

  • VMware ESXi 7.x 
  • VMware ESXi 8.x
  • VMware vCenter Server 7.x 
  • VMware vCenter Server 8.x
  • Environment using replication services (such as Veeam Replication, vSphere Replication or Nutanix CVM) 

Cause

Replication services can create more HTTPS sessions than it closes, and more than the ESXi host's envoy service can handle. The envoy service has a limit of 128 concurrent HTTPS sessions. When this limit is exceeded, connection failures occur between vCenter and the host.

Resolution

Immediate Work-around:

  1. Access the ESXi Host via SSH

    • Enable SSH on the affected ESXi host through the vSphere Client or Direct Console User Interface (DCUI).
    • Log in to the host as root using an SSH client.
  2. Identify the Source IP of Excessive Connections

    • Run the following command to count sessions per source IP address hitting the management port: Review this command before running it.
      grep ":443" /var/run/log/envoy-access.log | cut -d' ' -f 15 | sort | uniq | cut -d ':' -f 1 | uniq -c
    • To view currently active Envoy connections, run:
      localcli network ip connection list | grep envoy
  3. Stop the Offending Service

    • Based on the identified IP, locate the corresponding service (e.g., replication appliance, backup server, or security scanner).
    • Temporarily disable the service or the specific high-frequency jobs
  4. Restart the Envoy Service

    • Execute the following command on the ESXi host to clear all active sessions and allow vCenter to reconnect:
      /etc/init.d/envoy restart
  5. Verify Connectivity

    • Monitor the vSphere Client to confirm the host returns to a "Connected" or "Responding" state.
    • Check /var/run/log/envoy.log to ensure the remote https connections exceed max allowed: 128 warnings are no longer being generated.

Long-term Solution:

  1. Update replication software to latest version.
  2. Configure replication jobs to limit concurrent sessions.
  3. If issues persist:
    1. Implement firewall rules to limit concurrent connections from replication servers.
    2. Contact replication software vendor for additional guidance.

Additional Information