VMs intermittently loosing connection to ESXi host during VM configuration tasks

Products

VMware vSphere ESXi

Issue/Introduction

- At unexpected intervals when performing confirugation tasks on a VM the tasks intermittently fail with error
'An error occurred while communicating with the remote host'

- ESXi<->vCenter communication is routed through a Firewall tunnel (Physical FW devices or VM solutions)
- TTL is increased on Firewall however does not resolve the issue, see KB https://knowledge.broadcom.com/external/article/320776/
- Lowering the keep alive to be more aggressive on either ESXi or vCenter does not resolve the issue, see KB https://knowledge.broadcom.com/external/article/320776/

- /var/log/vmware/vpxd/vpxd.log on the vCenter has many references to multiple hosts going into not-responding state:

YYYY-MM-DDTHH:MM:27.407Z warning vpxd[#####] [Originator@6876 sub=MoHost opID=HB-host-#####@######-########] host [vim.HostSystem:host-#####,<ESXI_FQDN>] connection state changed to NO_RESPONSE
YYYY-MM-DDTHH:MM:27.661Z info vpxd[#####] [Originator@6876 sub=MoHost opID=HB-host-#####@######-########] host [vim.HostSystem:host-#####,<ESXI_FQDN>] connection state changed to CONNECTED
YYYY-MM-DDTHH:MM:08.012Z warning vpxd[#####] [Originator@6876 sub=MoHost opID=HB-host-#####@######-########] host [vim.HostSystem:host-#####,<ESXI_FQDN>] connection state changed to NO_RESPONSE
YYYY-MM-DDTHH:MM:08.128Z info vpxd[#####] [Originator@6876 sub=MoHost opID=HB-host-#####@######-########] host [vim.HostSystem:host-#####,<ESXI_FQDN>] connection state changed to CONNECTED
YYYY-MM-DDTHH:MM:48.811Z warning vpxd[########] [Originator@6876 sub=MoHost opID=HB-host-#####@######-########] host [vim.HostSystem:host-#####,<ESXI_FQDN>] connection state changed to NO_RESPONSE
YYYY-MM-DDTHH:MM:49.071Z info vpxd[########] [Originator@6876 sub=MoHost opID=HB-host-#####@######-########] host [vim.HostSystem:host-#####,<ESXI_FQDN>] connection state changed to CONNECTED

- /var/log/vmware/envoy-hgw/envoy-access-##.log on the vCenter has many references to "upstream_reset_before_response_started{connection_termination}" or "/sdk/service HTTP/1.1 503 no_healthy_upstream" messages at the same timestamps of that ESXi being marked as not-responding

- /var/log/vmware/envoy-hgw/envoy.log

"OPENSSL_internal:Connection reset by peer:TLS_error_end"

Environment

ESXi 8.x
vCenter 8.x

Cause

Attempting to establish an SSL connection from vCenter to the ESX host. The traffic is forwarded through a tunnel that terminated inside a guest VM.
The guest VM de-capsulates the tunnel headers and forwarded the inner TCP + SSL packets to the ESX host.
The SSL connection intermittently fails to establish. Packet captures will show that the ESX host was responding with an RST message to a valid TLS "Client Hello".
Due to an issue in ESX networking stack, certain packets are reordered and delivered to different RX processing threads.
The issue is observed only for packets originating from a VM on the same host. as it is highly timing-sensitive a Firewall tunnel configuration can introduce additional latency to some packets, increasing the probability of packet reordering and thereby aggravating the issue resulting in the frequent disconnect error messages

Resolution

Corner case

Issue is resolved in 9.x
Fix is expected in next ESX 8.x patch release