vSphere Replication sending large number of HTTPS requests to envoy proxy causing hostd to crash - remote https connections exceed max allowed

Products

VMware Live Recovery VMware vSphere ESXi

Issue/Introduction

Symptoms:

1. Host becomes unresponsive intermittently in vCenter.

2. Host becomes disconnected in vCenter.

3. Host client cannot be accessed.

4. While reconnecting/adding-host to vCenter it fails with below error:

5. Restarting all services on the host (services.sh restart) will make the host stable for sometime before going into a not responding state again.

6. Powering OFF vSphere Replication appliance stops the HTTPS requests made by VR thereby bringing the host to a normal state.

7. vCenter tasks are filling out with the error - A generic error occurred in the vSphere Replication Management Server. Exception details: 'Unexpected status code: 503'.

/var/run/log/envoy-access.log :

YYYY-MM-DD In(166) envoy-access[2098958]: POST /hbr HTTP/1.1 200 via_upstream - 586 455 1 1 0 10.#.#.#:50190 TLSv1.2 10.#.#.#:443 - - /var/run/vmware/proxy-hbr "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMS-PING" "Fetch"
YYYY-MM-DD In(166) envoy-access[2098958]: POST /hbr HTTP/1.1 200 via_upstream - 586 455 1 1 0 10.#.#.#:57578 TLSv1.2 10.#.#.#:443 - - /var/run/vmware/proxy-hbr "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMS-PING" "Fetch"
YYYY-MM-DD In(166) envoy-access[2098958]: POST /hbr HTTP/1.1 200 via_upstream - 579 1096 1 1 0 10.#.#.#:51832 TLSv1.2 10.#.#.#:443 - - /var/run/vmware/proxy-hbr "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMS-PING" "Fetch"
YYYY-MM-DD In(166) envoy-access[2098958]: POST /hbr HTTP/1.1 200 via_upstream - 586 455 1 1 0 10.#.#.#:51832 TLSv1.2 10.#.#.#:443 - - /var/run/vmware/proxy-hbr "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMS-PING" "Fetch"
YYYY-MM-DD In(166) envoy-access[2098958]: POST /hbr HTTP/1.1 200 via_upstream - 579 1096 1 1 0 10.#.#.#:51846 TLSv1.2 10.#.#.#:443 - - /var/run/vmware/proxy-hbr "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMS-PING" "Fetch"

Log shows a large number of HMS-PING calls received by VRMS.

/opt/vmware/hms/logs/hms.log :

YYYY-MM-DD 20:#:#.# TRACE hms.net.hbr.ping.svr.4c4c4544-0036-5910-8058-b8c04f485132 [hms-ping-scheduled-thread-0] (..net.impl.VmomiPingConnectionHandler) [operationID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMSINT-5641, operationID=fa551d0c-fbf5-4467-b566-94cf21633cb6-HMS-PING] | Session: N/A on server '10.#.#.#:443/hbr' pinged successfully
YYYY-MM-DD 20:#:#.# TRACE hms.net.hbr.ping.svr.4c4c4544-0036-5910-8058-b8c04f485132 [hms-ping-scheduled-thread-7] (..net.impl.VmomiPingConnectionHandler) [operationID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMSINT-5641, operationID=c8b40c05-961a-4e47-8f4c-5c24ceaf6037-HMS-PING] | Session: N/A on server '10.#.#.#:443/hbr' pinged successfully
YYYY-MM-DD 20:#:#.# TRACE hms.net.hbr.ping.svr.4c4c4544-0036-5910-8058-b8c04f485132 [hms-ping-scheduled-thread-1] (..net.impl.VmomiPingConnectionHandler) [operationID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMSINT-5641, operationID=59fdb338-3dba-4548-8b2a-05feadc7caf9-HMS-PING] | Session: N/A on server '10.#.#.#:443/hbr' pinged successfully
YYYY-MM-DD 20:#:#.# TRACE hms.net.hbr.ping.svr.4c4c4544-0036-5910-8058-b8c04f485132 [hms-ping-scheduled-thread-8] (..net.impl.VmomiPingConnectionHandler) [operationID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMSINT-5641, operationID=f2e0d00f-3868-4de4-b79f-17fcd1f1b3fb-HMS-PING] | Session: N/A on server '10.#.#.#:443/hbr' pinged successfully
YYYY-MM-DD 20:#:#.# TRACE hms.net.hbr.ping.svr.4c4c4544-0036-5910-8058-b8c04f485132 [hms-ping-scheduled-thread-9] (..net.impl.VmomiPingConnectionHandler) [operationID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMSINT-5641, operationID=2246fc94-f9de-4e6a-80d5-1ef54e279fef-HMS-PING] | Session: N/A on server '10.#.#.#:443/hbr' pinged successfully

/var/run/log/envoy.log :

YYYY-MM-DD In(166) envoy[2098941]: "YYYY-MM-DD warning envoy[2099266] [Originator@6876 sub=filter] [C18705] closing connection TCP<10.#.#.#:33556, 10.#.#.#:443>"
YYYY-MM-DD In(166) envoy[2098941]: "YYYY-MM-DD warning envoy[2099264] [Originator@6876 sub=filter] [C18706] remote https connections exceed max allowed: 128"
YYYY-MM-DD In(166) envoy[2098941]: "YYYY-MM-DD warning envoy[2099264] [Originator@6876 sub=filter] [C18706] closing connection TCP<10.#.#.#:33564, 10.#.#.#:443>"
YYYY-MM-DD envoy[2098941]: "YYYY-MM-DD warning envoy[2099265] [Originator@6876 sub=filter] [C18707] remote https connections exceed max allowed: 128"
YYYY-MM-DD In(166) envoy[2098941]: "YYYY-MM-DD warning envoy[2099265] [Originator@6876 sub=filter] [C18707] closing connection TCP<10.#.#.#:33570, 10.#.#.#:443>"
YYYY-MM-DD In(166) envoy[2098941]: "YYYY-MM-DD warning envoy[2099264] [Originator@6876 sub=filter] [C18708] remote https connections exceed max allowed: 128"

Environment

VMware vSphere Replication 8.x ,9.0

VMware vCenter 8.x

Cause

vSphere Replication causes the hosts to become unresponsive due to an authentication issue between the hms and hbrsrvuw in ESXi. This can randomly affect any ESXi host in the vCenter inventory.

Root cause is that getServers().registerHbrServer(hbrServerData) starts a new ReconnectingPing, but it is not cleaned when exception. So the leaked ReconnectingPing is always increasing every minute. Hbrsrvuw ReconnectingPing is leaked when unable to connect to hbrsrvuw, and then ESXi becomes unavailable.

Resolution

This issue has been fixed in vSphere Replication 9.0.1 | 25 JUN 2024 | Build 24037980 | Download

Please upgrade to this version to the earliest, if you are running on an older version of vSphere Replication appliance to avoid this issue. It has also been noticed that hosts still go into a 'Not Responding' state despite of upgrading the appliance to this version, in such cases please only disable scale-out-mode on the VRMS at both the sites.

Workaround:

The purpose of the workaround is to REMOVE all host based replication servers from the 'Replication Servers' tab in SRM UI. This will stop HMS from pinging the hbrsrvuw in the ESXi host.

🚨 This fix has to be applied on replication servers running version 8.8.X at both source and target sites. We are occasionally seeing 9.x releases also affected, so please apply the fix mentioned here.

1. SSH to vSphere Replication appliance. Run the command - systemctl stop hms

2. Edit /opt/vmware/hms/conf/hms-configuration.xml and change scale-out-mode to false.

<scale-out-mode>false</scale-out-mode>


NOTE: In vSphere replication 9.0 and higher this option is not compatible with enhanced replication, if scale-out-mode is set to true (enhanced replication is enabled) and hms-embedded-hbr is set to false an error with the phrase "Unable to connect to the HBR Management Server" will be observed.

3. Login to VR Database - /opt/vmware/hms/bin/embedded_db_connect.sh

4. Run the SQL commands -

NOTE: Please run these commands in the order they are mentioned below.

Query the hbrsrvuw which does not have replications from hbrtagentity & hbrserverentity tables.

select hbrserver_movalue from hbrtagentity where hbrserver_movalue IN (select movalue from hbrserverentity where vsrv_port = 443 AND NOT EXISTS (select hbrserver_movalue from secondarygroupentity where hbrserverentity.movalue = secondarygroupentity.hbrserver_movalue));

select hbrservername,movalue from hbrserverentity where vsrv_port = 443 AND NOT EXISTS (select hbrserver_movalue from secondarygroupentity where hbrserverentity.movalue = secondarygroupentity.hbrserver_movalue);

Delete the hbrsrvuw which does not have replications from hbrtagentity & hbrserverentity tables.

delete from hbrtagentity where hbrserver_movalue IN (select movalue from hbrserverentity where vsrv_port = 443 AND NOT EXISTS (select hbrserver_movalue from secondarygroupentity where hbrserverentity.movalue = secondarygroupentity.hbrserver_movalue));

delete from hbrserverentity where vsrv_port = 443 AND NOT EXISTS (select hbrserver_movalue from secondarygroupentity where hbrserverentity.movalue = secondarygroupentity.hbrserver_movalue);

5. Start HMS service - systemctl restart hms