vSphere Pod Traffic to ClusterIP Time-outs
search cancel

vSphere Pod Traffic to ClusterIP Time-outs

book

Article ID: 312199

calendar_today

Updated On: 10-16-2024

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

This article allows to quickly identify this known problem and fix it using one of the suggested workarounds.


Symptoms:

This issue only happens on Supervisor deployed with VDS Networking.
There are different possible symptoms:

  • vSphere Pod fails at deployment with pod state set to ErrImagePull. The command kubectl describe pod <podname> shows a failure when UDP is used to perform DNS name lookup during image fetching. For example, this is the kind of message that can be seen if the issue occurs when you install a contour service:

failed to get images: Image svc-contour-domain-c1034/contour-204b221aceed9528140334ab567d869a62181e99-v12771 has failed. Error: Failed to resolve on node <nodename>. Reason: Http request failed. Code 400: ErrorType(2) failed to do request: Head "https://projects.registry.vmware.com/v2/tkg/contour/manifests/sha256:8c5c66410ccca423b3b1635401a0fb379dbefea0e017773ec8939da1541e72b0": dial tcp: lookup projects.registry.vmware.com: i/o timeout

  • vSphere Pod fails slightly after deployment with pod state set to CrashLoopBackOff. The command kubectl logs <podname> shows a failure during a TCP connection to a cluster IP. For example, this is the kind of error message that can be seen if the issue occurs when you install a contour service:

time="2023-11-15T10:03:22Z" level=fatal msg="unable to initialize Server dependencies required to start Contour" error="unable to set up controller manager: Get \"https://10.212.0.1:443/api?timeout=32s\": net/http: TLS handshake timeout"

  • A vSphere Pod time-outs trying to establish a TCP connection to a ClusterIP, while direct connections to the backend IPs work fine.



Environment

VMware vSphere ESXi 8.0.x
VMware vCenter Server 8.0.x

Cause

On Supervisor deployed with VDS, any UDP or TCP traffic sent from a vSphere Pod to a ClusterIP will go through the gateway of the supervisor workload network, even if the backend IP behind the Cluster IP is on the same subnetwork as the emitting vSphere Pod. There is a chance that this traffic is blocked around the gateway for two possible reasons:

  • On the gateway, incoming UDP or TCP packets are supposed to be re-routed to the same network interface as the one from where they are coming from. This alone could be denied by security features (e.g router RPF check).

  • While packets from pod to ClusterIP will go through the gateway, packets in the way back won’t if the pod and the backend behind the ClusterIP are on the same subnetwork. In the case of a TCP connection, it means the gateway will only see the 1st and the 3rd packets of the TCP 3-way handshake. A firewall could block this traffic.

Resolution

VMware is aware of this issue and working to resolve this in a future release.


Workaround:

There are two possible workarounds:

  • Modify router and firewall configuration to allow any kind of traffic coming from the workload network interface to be routed back to the same network interface.
  • When choosing the gateway of the supervisor workload network, use an internal gateway with relaxed security rules. For example, instead of using your company external gateway you can point to an intermediate internal gateway that is not protected by a firewall and that is just one hop from your external gateway.