Symptoms:
1. Host becomes unresponsive intermittently in vCenter
2. Host client cannot be accessed
3. Restarting all services on the host (services.sh restart) will make the host stable for sometime before going into a not responding state again
4. Powering OFF vSphere Replication appliance stops the HTTPS requests made by VR thereby bringing the host to a normal state.
5. vCenter tasks are filling out with the error - A generic error occurred in the vSphere Replication Management Server. Exception details: 'Unexpected status code: 503'.
/var/run/log/envoy-access.log :
YYYY-MM-DD In(166) envoy-access[2098958]: POST /hbr HTTP/1.1 200 via_upstream - 586 455 1 1 0 10.#.#.#:50190 TLSv1.2 10.#.#.#:443 - - /var/run/vmware/proxy-hbr "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMS-PING" "Fetch"
YYYY-MM-DD In(166) envoy-access[2098958]: POST /hbr HTTP/1.1 200 via_upstream - 586 455 1 1 0 10.#.#.#:57578 TLSv1.2 10.#.#.#:443 - - /var/run/vmware/proxy-hbr "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMS-PING" "Fetch"
YYYY-MM-DD In(166) envoy-access[2098958]: POST /hbr HTTP/1.1 200 via_upstream - 579 1096 1 1 0 10.#.#.#:51832 TLSv1.2 10.#.#.#:443 - - /var/run/vmware/proxy-hbr "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMS-PING" "Fetch"
YYYY-MM-DD In(166) envoy-access[2098958]: POST /hbr HTTP/1.1 200 via_upstream - 586 455 1 1 0 10.#.#.#:51832 TLSv1.2 10.#.#.#:443 - - /var/run/vmware/proxy-hbr "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMS-PING" "Fetch"
YYYY-MM-DD In(166) envoy-access[2098958]: POST /hbr HTTP/1.1 200 via_upstream - 579 1096 1 1 0 10.#.#.#:51846 TLSv1.2 10.#.#.#:443 - - /var/run/vmware/proxy-hbr "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMS-PING" "Fetch"
Log shows a large number of HMS-PING calls received by VRMS.
/opt/vmware/hms/logs/hms.log :
YYYY-MM-DD 20:#:#.# TRACE hms.net.hbr.ping.svr.4c4c4544-0036-5910-8058-b8c04f485132 [hms-ping-scheduled-thread-0] (..net.impl.VmomiPingConnectionHandler) [operationID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMSINT-5641, operationID=fa551d0c-fbf5-4467-b566-94cf21633cb6-HMS-PING] | Session: N/A on server '10.#.#.#:443/hbr' pinged successfully
YYYY-MM-DD 20:#:#.# TRACE hms.net.hbr.ping.svr.4c4c4544-0036-5910-8058-b8c04f485132 [hms-ping-scheduled-thread-7] (..net.impl.VmomiPingConnectionHandler) [operationID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMSINT-5641, operationID=c8b40c05-961a-4e47-8f4c-5c24ceaf6037-HMS-PING] | Session: N/A on server '10.#.#.#:443/hbr' pinged successfully
YYYY-MM-DD 20:#:#.# TRACE hms.net.hbr.ping.svr.4c4c4544-0036-5910-8058-b8c04f485132 [hms-ping-scheduled-thread-1] (..net.impl.VmomiPingConnectionHandler) [operationID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMSINT-5641, operationID=59fdb338-3dba-4548-8b2a-05feadc7caf9-HMS-PING] | Session: N/A on server '10.#.#.#:443/hbr' pinged successfully
YYYY-MM-DD 20:#:#.# TRACE hms.net.hbr.ping.svr.4c4c4544-0036-5910-8058-b8c04f485132 [hms-ping-scheduled-thread-8] (..net.impl.VmomiPingConnectionHandler) [operationID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMSINT-5641, operationID=f2e0d00f-3868-4de4-b79f-17fcd1f1b3fb-HMS-PING] | Session: N/A on server '10.#.#.#:443/hbr' pinged successfully
YYYY-MM-DD 20:#:#.# TRACE hms.net.hbr.ping.svr.4c4c4544-0036-5910-8058-b8c04f485132 [hms-ping-scheduled-thread-9] (..net.impl.VmomiPingConnectionHandler) [operationID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-HMSINT-5641, operationID=2246fc94-f9de-4e6a-80d5-1ef54e279fef-HMS-PING] | Session: N/A on server '10.#.#.#:443/hbr' pinged successfully
/var/run/log/envoy.log :
YYYY-MM-DD In(166) envoy[2098941]: "YYYY-MM-DD warning envoy[2099266] [Originator@6876 sub=filter] [C18705] closing connection TCP<10.#.#.#:33556, 10.#.#.#:443>"
YYYY-MM-DD In(166) envoy[2098941]: "YYYY-MM-DD warning envoy[2099264] [Originator@6876 sub=filter] [C18706] remote https connections exceed max allowed: 128"
YYYY-MM-DD In(166) envoy[2098941]: "YYYY-MM-DD warning envoy[2099264] [Originator@6876 sub=filter] [C18706] closing connection TCP<10.#.#.#:33564, 10.#.#.#:443>"
YYYY-MM-DD envoy[2098941]: "YYYY-MM-DD warning envoy[2099265] [Originator@6876 sub=filter] [C18707] remote https connections exceed max allowed: 128"
YYYY-MM-DD In(166) envoy[2098941]: "YYYY-MM-DD warning envoy[2099265] [Originator@6876 sub=filter] [C18707] closing connection TCP<10.#.#.#:33570, 10.#.#.#:443>"
YYYY-MM-DD In(166) envoy[2098941]: "YYYY-MM-DD warning envoy[2099264] [Originator@6876 sub=filter] [C18708] remote https connections exceed max allowed: 128"
vSphere Replication causes the hosts to become unresponsive due to an authentication issue between the hms and hbrsrvuw in ESXi. This can randomly affect any ESXi host in the vCenter inventory.
Root cause is that getServers().registerHbrServer(hbrServerData) starts a new ReconnectingPing, but it is not cleaned when exception. So the leaked ReconnectingPing is always increasing every minute. Hbrsrvuw ReconnectingPing is leaked when unable to connect to hbrsrvuw, and then ESXi becomes unavailable.
Please upgrade to this version to the earliest, if you are running on an older version of vSphere Replication appliance to avoid this issue. It has also been noticed that hosts still go into a 'Not Repsonding' state despite of upgrading the appliance to this version, in such cases please only disable scale-out-mode on the VRMS at both the sites.
Workaround:
The purpose of the workaround is to REMOVE all host based replication servers from the 'Replication Servers' tab in SRM UI. This will stop HMS from pinging the hbrsrvuw in the ESXi host.
NOTE: This fix has to be applied on replication servers running version 8.8.X at source and target sites.
1. SSH to vSphere Replication appliance. Run the command - systemctl stop hms
2. Edit /opt/vmware/hms/conf/hms-configuration.xml and change scale-out-mode to false.
<scale-out-mode>false</scale-out-mode>
<!--
Timeout to wait before tagging hbrsrv as decommissioned due to maintenance mode.
At the moment set to 0 since replications are auto-released on hbrsrvuw when ESX host enters MM.
-->
3. Login to VR Database - /opt/vmware/hms/bin/embedded_db_connect.sh
4. Run the SQL commands -
NOTE: Please run these commands in the order they are mentioned below.
select hbrserver_movalue from hbrtagentity where hbrserver_movalue IN (select movalue from hbrserverentity where vsrv_port = 443 AND NOT EXISTS (select hbrserver_movalue from secondarygroupentity where hbrserverentity.movalue = secondarygroupentity.hbrserver_movalue)); select hbrservername,movalue from hbrserverentity where vsrv_port = 443 AND NOT EXISTS (select hbrserver_movalue from secondarygroupentity where hbrserverentity.movalue = secondarygroupentity.hbrserver_movalue);
delete from hbrtagentity where hbrserver_movalue IN (select movalue from hbrserverentity where vsrv_port = 443 AND NOT EXISTS (select hbrserver_movalue from secondarygroupentity where hbrserverentity.movalue = secondarygroupentity.hbrserver_movalue)); delete from hbrserverentity where vsrv_port = 443 AND NOT EXISTS (select hbrserver_movalue from secondarygroupentity where hbrserverentity.movalue = secondarygroupentity.hbrserver_movalue);
5. Start HMS service - systemctl restart hms