SOS Health Check Task Failing Intermittently in SDDC Manager
search cancel

SOS Health Check Task Failing Intermittently in SDDC Manager

book

Article ID: 407596

calendar_today

Updated On:

Products

VMware SDDC Manager VMware vSphere ESXi

Issue/Introduction

  • While Checking /var/log/vmware/vcf/sddc-support/sos-<timestamp>/health-report.log generated for the Failed Task we can notice similar entries:
    | XX |              ESXi : FQDN_OF ESXi_               |        Ping status         | GREEN |
    |     |                                                |  API Connectivity status   |  RED  |
    | XX |              ESXI : FQDN_OF ESXi_               | svc-vcf-FQDN_OF ESXi  |    MM DD YYYY   |    Never     |      Never      |         GREEN         |
    |     |                                                |           root          |         -         |      -       |        -        | Failed to get details |
    | XX |              ESXi : FQDN_OF ESXi_               | NTP Status | YELLOW |
    |     |                                                |  ESX Time  | YELLOW |
  • While Checking /var/log/vmware/vcf/sddc-support/sos-<timestamp>/sos.log during the time-stamp of the failed task we can notice similar entries:
    YYYY-MM-DDTHH:MM:SS+0000 ERROR [vcf_sos] [vc.py::get_si::68::_parallel_password_checkThread6] (vim.fault.HostConnectFault) {
       dynamicType = <unset>,
       dynamicProperty = (vmodl.DynamicProperty) [],
       msg = '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)',
       faultCause = <unset>,
       faultMessage = (vmodl.LocalizableMessage) []
    }
    YYYY-MM-DDTHH:MM:SS+0000 ERROR [vcf_sos] [vc.py::get_si::69::_parallel_password_checkThread6] Traceback (most recent call last):
      File "utils/vc.py", line 104, in connect
      File "utils/vc.py", line 61, in get_si
      File "/opt/vmware/sddc-support/services/../framework/../dependency/pyVpx/pyVim/connect.py", line 281, in Connect
        si, stub = __Login(host,
      File "/opt/vmware/sddc-support/services/../framework/../dependency/pyVpx/pyVim/connect.py", line 431, in __Login
        content = si.RetrieveContent()
      File "/opt/vmware/sddc-support/services/../framework/../dependency/pyVpx/pyVmomi/VmomiSupport.py", line 586, in <lambda>
        self.f(*(self.args + (obj,) + args), **kwargs)
      File "/opt/vmware/sddc-support/services/../framework/../dependency/pyVpx/pyVmomi/VmomiSupport.py", line 376, in _InvokeMethod
        return self._stub.InvokeMethod(self, info, args)
      File "/opt/vmware/sddc-support/services/../framework/../dependency/pyVpx/pyVmomi/SoapAdapter.py", line 1549, in InvokeMethod
        raise obj  # pylint: disable-msg=E0702
    pyVmomi.VmomiSupport.vmodl.RuntimeFault: (vmodl.RuntimeFault) {
       dynamicType = <unset>,
       dynamicProperty = (vmodl.DynamicProperty) [],
       msg = 'Unsupported version URI "urn:vpxd3/8.0.0.1"\n\nwhile parsing SOAP body\nat line 3, column 0\n\nwhile parsing SOAP envelope\nat line 2, column 0\n\nwhile parsing HTTP request before method was determined\nat line 1, column 0',
       faultCause = <unset>,
       faultMessage = (vmodl.LocalizableMessage) []
    }
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "utils/vc.py", line 61, in get_si
      File "/opt/vmware/sddc-support/services/../framework/../dependency/pyVpx/pyVim/connect.py", line 281, in Connect
        si, stub = __Login(host,
      File "/opt/vmware/sddc-support/services/../framework/../dependency/pyVpx/pyVim/connect.py", line 435, in __Login
        raise vim.fault.HostConnectFault(msg=str(e))
    pyVmomi.VmomiSupport.vim.fault.HostConnectFault: (vim.fault.HostConnectFault) {
       dynamicType = <unset>,
       dynamicProperty = (vmodl.DynamicProperty) [],
       msg = '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)',
       faultCause = <unset>,
       faultMessage = (vmodl.LocalizableMessage) []
    }
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "health/helper/esxihelper.py", line 2295, in _parallel_password_check
      File "utils/vc.py", line 106, in connect
      File "utils/vc.py", line 72, in get_si
    AssertionError: Unable to connect to host FQDN_OF ESXi_

Environment

VMware Cloud Foundation
VMware vSphere ESXi

Cause

  • The issue is caused when the envoy proxy service for the affected ESXi Host reaches the maximum number of connections it can support, In /var/log/envoy.log you will notice the below warnings:
    YYYY-MM-DDTHH:MM:SSZ In(166) envoy[2099870]: "YYYY-MM-DDTHH:MM:SSZ warning envoy[2100520] [Originator@6876 sub=filter] [Tags: "ConnectionId":"#########"] remote https connections exceed max allowed: 128"
    YYYY-MM-DDTHH:MM:SSZ In(166) envoy[2099870]: "YYYY-MM-DDTHH:MM:SSZ warning envoy[2100520] [Originator@6876 sub=filter] [Tags: "ConnectionId":"#########"] remote https connections exceed max allowed: 128"
  • The maximum connection limit is reached when vSphere Replication does not close HTTP connections which are created as part of the health checks during the configuration of enhanced replication:
    Proto  RecvQ  SendQ  Local Address        Foreign Address             State    World ID   CC Algo  World Name
    -----  ------  ------  -------------------  -------------------     -----------  --------  -------  ----------
    tcp         0       0  10.176.###.###:443    10.176.###.###:54108   ESTABLISHED  35101291  newreno  envoy
    tcp         0       0  10.176.###.###:443    10.176.###.###:54100   ESTABLISHED  35101291  newreno  envoy
    tcp         0       0  10.176.###.###:443    10.176.###.###:54084   ESTABLISHED  35101291  newreno  envoy
    tcp         0       0  10.176.###.###:443    10.176.###.###:54080   ESTABLISHED  35101290  newreno  envoy
  • In /var/log/envoy-access.log for the affected ESXi Host you will notice the below connections open for hours:
    YYYY-MM-DDTHH:MM:SSZ In(166) envoy-access[########]: GET /hbragent/api/v1.0/appPing?broker_ip=10.191.##.##&broker_port=32032&group=PING-GID-5243c529-######-##### 200 via_upstream - 0 387 - 107 106 0 10.###.###.###:34164 HTTP/1.1 TLSv1.2 10.###.###.###:443 - HTTP/1.1 - /var/run/vmware/hbragent-rest-tunnel - -

Resolution

Broadcom is aware of this issue and is working on a fix.

Workaround:

  1. Open a SSH session to the VRMS server on both the sites.
  2. Open file /opt/vmware/hms/conf/hms-configuration.xml with a text editor
  3. Set schedule-health-checks to false
  4. Restart HMS service on both sites
    systemctl restart hms
  5. While configuring enhanced replication, skip the health check. Clicking the "Next" button will allow you to proceed with the replication configuration without performing the health check.