SOS Health Check Task Failing Intermittently in SDDC Manager

Products

VMware SDDC Manager VMware vSphere ESXi

Issue/Introduction

While Checking /var/log/vmware/vcf/sddc-support/sos-<timestamp>/health-report.log generated for the Failed Task we can notice similar entries:

| XX |              ESXi : FQDN_OF ESXi_               |        Ping status         | GREEN |
|     |                                                |  API Connectivity status   |  RED  |
| XX |              ESXI : FQDN_OF ESXi_               | svc-vcf-FQDN_OF ESXi  |    MM DD YYYY   |    Never     |      Never      |         GREEN         |
|     |                                                |           root          |         -         |      -       |        -        | Failed to get details |
| XX |              ESXi : FQDN_OF ESXi_               | NTP Status | YELLOW |
|     |                                                |  ESX Time  | YELLOW |

While Checking /var/log/vmware/vcf/sddc-support/sos-<timestamp>/sos.log during the time-stamp of the failed task we can notice similar entries:

YYYY-MM-DDTHH:MM:SS+0000 ERROR [vcf_sos] [vc.py::get_si::68::_parallel_password_checkThread6] (vim.fault.HostConnectFault) {
   dynamicType = <unset>,
   dynamicProperty = (vmodl.DynamicProperty) [],
   msg = '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)',
   faultCause = <unset>,
   faultMessage = (vmodl.LocalizableMessage) []
}
YYYY-MM-DDTHH:MM:SS+0000 ERROR [vcf_sos] [vc.py::get_si::69::_parallel_password_checkThread6] Traceback (most recent call last):
  File "utils/vc.py", line 104, in connect
  File "utils/vc.py", line 61, in get_si
  File "/opt/vmware/sddc-support/services/../framework/../dependency/pyVpx/pyVim/connect.py", line 281, in Connect
    si, stub = __Login(host,
  File "/opt/vmware/sddc-support/services/../framework/../dependency/pyVpx/pyVim/connect.py", line 431, in __Login
    content = si.RetrieveContent()
  File "/opt/vmware/sddc-support/services/../framework/../dependency/pyVpx/pyVmomi/VmomiSupport.py", line 586, in <lambda>
    self.f(*(self.args + (obj,) + args), **kwargs)
  File "/opt/vmware/sddc-support/services/../framework/../dependency/pyVpx/pyVmomi/VmomiSupport.py", line 376, in _InvokeMethod
    return self._stub.InvokeMethod(self, info, args)
  File "/opt/vmware/sddc-support/services/../framework/../dependency/pyVpx/pyVmomi/SoapAdapter.py", line 1549, in InvokeMethod
    raise obj  # pylint: disable-msg=E0702
pyVmomi.VmomiSupport.vmodl.RuntimeFault: (vmodl.RuntimeFault) {
   dynamicType = <unset>,
   dynamicProperty = (vmodl.DynamicProperty) [],
   msg = 'Unsupported version URI "urn:vpxd3/8.0.0.1"\n\nwhile parsing SOAP body\nat line 3, column 0\n\nwhile parsing SOAP envelope\nat line 2, column 0\n\nwhile parsing HTTP request before method was determined\nat line 1, column 0',
   faultCause = <unset>,
   faultMessage = (vmodl.LocalizableMessage) []
}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "utils/vc.py", line 61, in get_si
  File "/opt/vmware/sddc-support/services/../framework/../dependency/pyVpx/pyVim/connect.py", line 281, in Connect
    si, stub = __Login(host,
  File "/opt/vmware/sddc-support/services/../framework/../dependency/pyVpx/pyVim/connect.py", line 435, in __Login
    raise vim.fault.HostConnectFault(msg=str(e))
pyVmomi.VmomiSupport.vim.fault.HostConnectFault: (vim.fault.HostConnectFault) {
   dynamicType = <unset>,
   dynamicProperty = (vmodl.DynamicProperty) [],
   msg = '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)',
   faultCause = <unset>,
   faultMessage = (vmodl.LocalizableMessage) []
}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "health/helper/esxihelper.py", line 2295, in _parallel_password_check
  File "utils/vc.py", line 106, in connect
  File "utils/vc.py", line 72, in get_si
AssertionError: Unable to connect to host FQDN_OF ESXi_

Environment

VMware Cloud Foundation
VMware vSphere ESXi

Cause

The issue is caused when the envoy proxy service for the affected ESXi Host reaches the maximum number of connections it can support, In /var/log/envoy.log you will notice the below warnings:

YYYY-MM-DDTHH:MM:SSZ In(166) envoy[2099870]: "YYYY-MM-DDTHH:MM:SSZ warning envoy[2100520] [Originator@6876 sub=filter] [Tags: "ConnectionId":"#########"] remote https connections exceed max allowed: 128"
YYYY-MM-DDTHH:MM:SSZ In(166) envoy[2099870]: "YYYY-MM-DDTHH:MM:SSZ warning envoy[2100520] [Originator@6876 sub=filter] [Tags: "ConnectionId":"#########"] remote https connections exceed max allowed: 128"

The maximum connection limit is reached when vSphere Replication does not close HTTP connections which are created as part of the health checks during the configuration of enhanced replication:

Proto  RecvQ  SendQ  Local Address        Foreign Address             State    World ID   CC Algo  World Name
-----  ------  ------  -------------------  -------------------     -----------  --------  -------  ----------
tcp         0       0  10.176.###.###:443    10.176.###.###:54108   ESTABLISHED  35101291  newreno  envoy
tcp         0       0  10.176.###.###:443    10.176.###.###:54100   ESTABLISHED  35101291  newreno  envoy
tcp         0       0  10.176.###.###:443    10.176.###.###:54084   ESTABLISHED  35101291  newreno  envoy
tcp         0       0  10.176.###.###:443    10.176.###.###:54080   ESTABLISHED  35101290  newreno  envoy

In /var/log/envoy-access.log for the affected ESXi Host you will notice the below connections open for hours:

YYYY-MM-DDTHH:MM:SSZ In(166) envoy-access[########]: GET /hbragent/api/v1.0/appPing?broker_ip=10.191.##.##&broker_port=32032&group=PING-GID-5243c529-######-##### 200 via_upstream - 0 387 - 107 106 0 10.###.###.###:34164 HTTP/1.1 TLSv1.2 10.###.###.###:443 - HTTP/1.1 - /var/run/vmware/hbragent-rest-tunnel - -

Resolution

Broadcom is aware of this issue and is working on a fix.

Workaround:

Open a SSH session to the VRMS server on both the sites.
Open file /opt/vmware/hms/conf/hms-configuration.xml with a text editor
Set schedule-health-checks to false
Restart HMS service on both sites
```
systemctl restart hms
```
While configuring enhanced replication, skip the health check. Clicking the "Next" button will allow you to proceed with the replication configuration without performing the health check.