ESXi goes to not responding state during pre check of the NSX upgrade from SDDC.
search cancel

ESXi goes to not responding state during pre check of the NSX upgrade from SDDC.

book

Article ID: 425248

calendar_today

Updated On:

Products

VMware SDDC Manager

Issue/Introduction

  • When pre-upgrade checks are done and when NSX edge upgrades are started from SDDC , the hosts in the environment go into an unresponsive state on VC.
  • Restarting the envoy service on ESXi temporarily resolves the issue.
    • /etc/init.d/envoy restart
  • Edge upgrade failed in vmReconfigure during edge upgrade, with the following exception in the NSX manager node, the upgrade-coordinator.log 
    • YYYY-MM-DDTHH:MM:SS.Z INFO task-executor-5-1-workitem-EDGE-1b###f05-24dc-42f2-####-659b4ff33475 EdgeNodeUpgradeServiceImpl 4234 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="upgrade-coordinator"] Edge VM 1bacaf05-####-42f2-b44b-659b#####475 got powered off successful.
      YYYY-MM-DDTHH:MM:SS.Z INFO task-executor-5-1-workitem-EDGE-1b###f05-24dc-42f2-####-659b4ff33475 NsxUpgradeMetricsServiceImpl 4234 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="upgrade-coordinator"] Processing event poweroffEdgeVMComplete on upgrade-unit 1bacaf05-24dc-####-b44b-659b#####475
      YYYY-MM-DDTHH:MM:SS.Z INFO task-executor-5-1-workitem-EDGE-1b###f05-24dc-42f2-####-659b4ff33475 EdgeNodeUpgradeServiceImpl 4234 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="upgrade-coordinator"] Edge vm got powered off at 1763743236218. Edge upgrade will wait for 20000 milli sec after edge vm powered off
      YYYY-MM-DDTHH:MM:SS.Z INFO task-executor-5-1-workitem-EDGE-1b###f05-24dc-####-b44b-659b#####475 VmOperations 4234 SYSTEM [nsx@6876 comp="nsx-manager" level="INFO" subcomp="upgrade-coordinator"] Editing/Adding extra configs [{ethernet0.ctxPerDev=3, ethernet1.ctxPerDev=3, ethernet2.ctxPerDev=3, ethernet3.ctxPerDev=3, ethernet4.ctxPerDev=3, ethernet0.udpRSS=1, ethernet1.udpRSS=1, ethernet2.udpRSS=1, ethernet3.udpRSS=1, ethernet4.udpRSS=1, ethernet0.pnicFeatures=4, ethernet1.pnicFeatures=4, ethernet2.pnicFeatures=4, ethernet3.pnicFeatures=4, ethernet4.pnicFeatures=4, featMask.vm.cpuid.PDPE1GB=Val:1, snapshot.maxSnapshots=0, sched.mem.shareCosBufSize=32}] for VM vm-<id>on vc 6ba2b23d-####-44e0-8979-98d86####159.
      YYYY-MM-DDTHH:MM:SS.ZZ ERROR task-executor-5-1-workitem-EDGE-########-24dc-42f2-####-659b4ff33475 VcUtilities 4234 SYSTEM [nsx@6876 comp="nsx-manager" errorCode="MP40407" level="ERROR" subcomp="upgrade-coordinator"] Wait for completion failed with 'Unable to communicate with the remote host, since it is disconnected.'
      com.vmware.vim.binding.vmodl.fault.HostNotConnected: Unable to communicate with the remote host, since it is disconnected
  • Log from vCenter's vpxd.log for the failed host:
    • YYYY-MM-DDTHH:MM:SS.Z info vpxd[3540469] [Originator@6876 sub=vmomi.soapStub[212] opID=HostSync-host-##-2b####72] SOAP request returned HTTP failure; <<io_obj p:0x00007f7####56c48, h:127, <UNIX ''>, <UNIX '/var/run/envoy-hgw/hgw-pipe'>>, /hgw/host-##/vpxa>, method: getChanges; code: 503(Service Unavailable); fault: (null)
      ...
      YYYY-MM-DDTHH:MM:SS.Z warning vpxd[3540469] [Originator@6876 sub=MoHost opID=HostSync-host-##-2b####72] host [vim.HostSystem:host-<id>,<host_FQDN] connection state changed to NO_RESPONSE
  • ESXi envoy logs report:
    • YYYY-MM-DDTHH:MM:SS.Z In(166) envoy[21347283]: "YYYY-MM-DDTHH:MM:SS.Z warning envoy[21347296] [Originator@6876 sub=filter] [Tags: "ConnectionId":"280230"] remote https connections exceed max allowed: 128"
      YYYY-MM-DDTHH:MM:SS.Z In(166) envoy[21347283]: "YYYY-MM-DDTHH:MM:SS.Z warning envoy[21347296] [Originator@6876 sub=filter] [Tags: "ConnectionId":"280230"] closing connection TCP<<VC_IP:46232, HOST_IP:443>"
      YYYY-MM-DDTHH:MM:SS.Z In(166) envoy[21347283]: "YYYY-MM-DDTHH:MM:SS.Z info envoy[21347296] [Originator@6876 sub=connection] [Tags: "ConnectionId":"280230"] remote address:<Host_IP>:46232,TLS_error:|268435588:SSL routines:OPENSSL_internal:CLIENTHELLO_TLSEXT|268435646:SSL routines:OPENSSL_internal:PARSE_TLSEXT"
  • The above log snippets explain that the envoy hits its connection limit - 128,  and started rejecting new connections. And hence this explain why the restart of envoy service on ESXi temporarily resolves the issue.

Environment

VMware ESXi 8.0 U3g and before.

Cause

  • There is a cron job that runs from SDDC everyday at 1 PM UTC.
  • This cron job is ran via a VIM client on SDDC, that periodically does the below things:
    1. Reads /etc/passwd from guest (API InitiateFileTransferFromGuest and subsequent GET)
    2. Creates temp file in guest (API CreateTemporaryFileInGuest)
    3. Runs a program in guest that should populate the temp file created in step 2 (API StartProgramInGuest)
    4. Reads the temp file from guest (via API InitiateFileTransferFromGuest and subsequent GET)
  • Due to a known issue, the connections opened by GET's (in step 1 and step 4 above) are not closed. As a result, that client exhausts ESX envoy's connection limit of max 128 connections.

Resolution

  • This issue is permanently resolved in ESXi 8.0 U3h.
  • Upgrade ESXi host to ESXi 8.0 U3h.