vSphere Replication sending large number of HTTPS requests to envoy proxy causing hostd to crash - remote https connections exceed max allowed
search cancel

vSphere Replication sending large number of HTTPS requests to envoy proxy causing hostd to crash - remote https connections exceed max allowed

book

Article ID: 312726

calendar_today

Updated On:

Products

VMware Live Recovery

Issue/Introduction

Symptoms:


1. Host becomes unresponsive intermittently in vCenter

2. Host client cannot be accessed

3. Restarting all services on the host (services.sh restart) will make the host stable for sometime before going into a not responding state again

4. Powering OFF vSphere Replication appliance stops the HTTPS requests made by VR thereby bringing the host to a normal state. 

5. vCenter tasks are filling out with the error - A generic error occurred in the vSphere Replication Management Server. Exception details: 'Unexpected status code: 503'. 

/var/run/log/envoy-access.log : 

2023-11-29T20:31:32.921Z In(166) envoy-access[2098958]: POST /hbr HTTP/1.1 200 via_upstream - 586 455 1 1 0 10.#.#.#:50190 TLSv1.2 10.#.#.#:443 - - /var/run/vmware/proxy-hbr "fa8aee91-b549-44e1-bff3-d1544d371df9-HMS-PING" "Fetch"
2023-11-29T20:31:33.144Z In(166) envoy-access[2098958]: POST /hbr HTTP/1.1 200 via_upstream - 586 455 1 1 0 10.#.#.#:57578 TLSv1.2 10.#.#.#:443 - - /var/run/vmware/proxy-hbr "6c808c74-dbfd-4c80-a0f9-53108b26a4b5-HMS-PING" "Fetch"
2023-11-29T20:31:33.193Z In(166) envoy-access[2098958]: POST /hbr HTTP/1.1 200 via_upstream - 579 1096 1 1 0 10.#.#.#:51832 TLSv1.2 10.#.#.#:443 - - /var/run/vmware/proxy-hbr "aaf42441-4005-416b-b4f9-7914b73ea68b-HMS-PING" "Fetch"
2023-11-29T20:31:33.395Z In(166) envoy-access[2098958]: POST /hbr HTTP/1.1 200 via_upstream - 586 455 1 1 0 10.#.#.#:51832 TLSv1.2 10.#.#.#:443 - - /var/run/vmware/proxy-hbr "aaf42441-4005-416b-b4f9-7914b73ea68b-HMS-PING" "Fetch"
2023-11-29T20:31:33.738Z In(166) envoy-access[2098958]: POST /hbr HTTP/1.1 200 via_upstream - 579 1096 1 1 0 10.#.#.#:51846 TLSv1.2 10.#.#.#:443 - - /var/run/vmware/proxy-hbr "f8342810-56b1-4fc8-8e3f-d8d270f215ce-HMS-PING" "Fetch"

Log shows a large number of HMS-PING calls received by VRMS. 

/opt/vmware/hms/logs/hms.log : 

2023-11-29 20:36:58.135 TRACE hms.net.hbr.ping.svr.4c4c4544-0036-5910-8058-b8c04f485132 [hms-ping-scheduled-thread-0] (..net.impl.VmomiPingConnectionHandler) [operationID=815efc5f-85f9-4d71-88b3-473d9d86cdbe-HMSINT-5641, operationID=fa551d0c-fbf5-4467-b566-94cf21633cb6-HMS-PING] | Session: N/A on server '10.#.#.#:443/hbr' pinged successfully
2023-11-29 20:36:58.135 TRACE hms.net.hbr.ping.svr.4c4c4544-0036-5910-8058-b8c04f485132 [hms-ping-scheduled-thread-7] (..net.impl.VmomiPingConnectionHandler) [operationID=815efc5f-85f9-4d71-88b3-473d9d86cdbe-HMSINT-5641, operationID=c8b40c05-961a-4e47-8f4c-5c24ceaf6037-HMS-PING] | Session: N/A on server '10.#.#.#:443/hbr' pinged successfully
2023-11-29 20:36:58.135 TRACE hms.net.hbr.ping.svr.4c4c4544-0036-5910-8058-b8c04f485132 [hms-ping-scheduled-thread-1] (..net.impl.VmomiPingConnectionHandler) [operationID=815efc5f-85f9-4d71-88b3-473d9d86cdbe-HMSINT-5641, operationID=59fdb338-3dba-4548-8b2a-05feadc7caf9-HMS-PING] | Session: N/A on server '10.#.#.#:443/hbr' pinged successfully
2023-11-29 20:36:58.135 TRACE hms.net.hbr.ping.svr.4c4c4544-0036-5910-8058-b8c04f485132 [hms-ping-scheduled-thread-8] (..net.impl.VmomiPingConnectionHandler) [operationID=815efc5f-85f9-4d71-88b3-473d9d86cdbe-HMSINT-5641, operationID=f2e0d00f-3868-4de4-b79f-17fcd1f1b3fb-HMS-PING] | Session: N/A on server '10.#.#.#:443/hbr' pinged successfully
2023-11-29 20:36:58.170 TRACE hms.net.hbr.ping.svr.4c4c4544-0036-5910-8058-b8c04f485132 [hms-ping-scheduled-thread-9] (..net.impl.VmomiPingConnectionHandler) [operationID=815efc5f-85f9-4d71-88b3-473d9d86cdbe-HMSINT-5641, operationID=2246fc94-f9de-4e6a-80d5-1ef54e279fef-HMS-PING] | Session: N/A on server '10.#.#.#:443/hbr' pinged successfully

/var/run/log/envoy.log : 

2023-11-29T19:51:28.438Z In(166) envoy[2098941]: "2023-11-29T19:51:23.497Z warning envoy[2099266] [Originator@6876 sub=filter] [C18705] closing connection TCP<10.#.#.#:33556, 10.#.#.#:443>"
2023-11-29T19:51:28.438Z In(166) envoy[2098941]: "2023-11-29T19:51:23.530Z warning envoy[2099264] [Originator@6876 sub=filter] [C18706] remote https connections exceed max allowed: 128"
2023-11-29T19:51:28.438Z In(166) envoy[2098941]: "2023-11-29T19:51:23.530Z warning envoy[2099264] [Originator@6876 sub=filter] [C18706] closing connection TCP<10.#.#.#:33564, 10.#.#.#:443>"

2023-11-29T19:51:28.438Z In(166) envoy[2098941]: "2023-11-29T19:51:23.552Z warning envoy[2099265] [Originator@6876 sub=filter] [C18707] remote https connections exceed max allowed: 128"
2023-11-29T19:51:28.438Z In(166) envoy[2098941]: "2023-11-29T19:51:23.552Z warning envoy[2099265] [Originator@6876 sub=filter] [C18707] closing connection TCP<10.#.#.#:33570, 10.#.#.#:443>"
2023-11-29T19:51:28.438Z In(166) envoy[2098941]: "2023-11-29T19:51:23.989Z warning envoy[2099264] [Originator@6876 sub=filter] [C18708] remote https connections exceed max allowed: 128"

 

Environment

VMware Site Recovery Manager 8.x

Cause


vSphere Replication causes the hosts to become unresponsive due to an authentication issue between the hms and hbrsrvuw in ESXi. This can randomly affect any ESXi host in the vCenter inventory. 

Root cause is that getServers().registerHbrServer(hbrServerData) starts a new ReconnectingPing, but it is not cleaned when exception. So the leaked ReconnectingPing is always increasing every minute. Hbrsrvuw ReconnectingPing is leaked when unable to connect to hbrsrvuw, and then ESXi becomes unavailable.

Resolution

This issue has been fixed in vSphere Replication 9.0.0.1 | 19 APR 2024 | Build 23690274 | Download

Please upgrade to this version to the earliest, if you are running on an older version of vSphere Replication appliance to avoid this issue. It has also been noticed that hosts still go into a 'Not Repsonding' state despite of upgrading the appliance to this version, in such cases please only disable scale-out-mode on the VRMS at both the sites. 

Workaround:

The purpose of the workaround is to REMOVE all host based replication servers from the 'Replication Servers' tab in SRM UI. This will stop HMS from pinging the hbrsrvuw in the ESXi host. 

NOTE: This fix has to be applied on replication servers running version 8.8.X at source and target sites. 

1. SSH to vSphere Replication appliance. Run the command - systemctl stop hms

2. Edit /opt/vmware/hms/conf/hms-configuration.xml and change scale-out-mode to false.

<scale-out-mode>false</scale-out-mode>
<!--
Timeout to wait before tagging hbrsrv as decommissioned due to maintenance mode.
At the moment set to 0 since replications are auto-released on hbrsrvuw when ESX host enters MM.
-->

3. Login to VR Database - /opt/vmware/hms/bin/embedded_db_connect.sh

4. Run the SQL commands - 

NOTE: Please run these commands in the order they are mentioned below. 

Query the hbrsrvuw which does not have replications from hbrtagentity & hbrserverentity tables.

select hbrserver_movalue from hbrtagentity where hbrserver_movalue IN (select movalue from hbrserverentity where vsrv_port = 443 AND NOT EXISTS (select hbrserver_movalue from secondarygroupentity where hbrserverentity.movalue = secondarygroupentity.hbrserver_movalue));

select hbrservername,movalue from hbrserverentity where vsrv_port = 443 AND NOT EXISTS (select hbrserver_movalue from secondarygroupentity where hbrserverentity.movalue = secondarygroupentity.hbrserver_movalue);


Delete the hbrsrvuw which does not have replications from hbrtagentity & hbrserverentity tables.

delete from hbrtagentity where hbrserver_movalue IN (select movalue from hbrserverentity where vsrv_port = 443 AND NOT EXISTS (select hbrserver_movalue from secondarygroupentity where hbrserverentity.movalue = secondarygroupentity.hbrserver_movalue));

delete from hbrserverentity where vsrv_port = 443 AND NOT EXISTS (select hbrserver_movalue from secondarygroupentity where hbrserverentity.movalue = secondarygroupentity.hbrserver_movalue);


5. Start HMS service - systemctl restart hms