This article allows to quickly identify this known problem and fix it using one of the suggested workarounds.
This issue only happens on Supervisor deployed with VDS Networking.
There are different possible symptoms:
vSphere Pod fails at deployment with pod state set to ErrImagePull. The command kubectl describe pod <podname> shows a failure when UDP is used to perform DNS name lookup during image fetching. For example, this is the kind of message that can be seen if the issue occurs when you install a contour service:
failed to get images: Image svc-contour-domain-c1034/contour-204b221aceed9528140334ab567d869a62181e99-v12771 has failed. Error: Failed to resolve on node <nodename>. Reason: Http request failed. Code 400: ErrorType(2) failed to do request: Head "https://projects.registry.vmware.com/v2/tkg/contour/manifests/sha256:8c5c66410ccca423b3b1635401a0fb379dbefea0e017773ec8939da1541e72b0": dial tcp: lookup projects.registry.vmware.com: i/o timeout
vSphere Pod fails slightly after deployment with pod state set to CrashLoopBackOff. The command kubectl logs <podname> shows a failure during a TCP connection to a cluster IP. For example, this is the kind of error message that can be seen if the issue occurs when you install a contour service:
time="2023-11-15T10:03:22Z" level=fatal msg="unable to initialize Server dependencies required to start Contour" error="unable to set up controller manager: Get \"https://10.212.0.1:443/api?timeout=32s\": net/http: TLS handshake timeout"
On Supervisor deployed with VDS, any UDP or TCP traffic sent from a vSphere Pod to a ClusterIP will go through the gateway of the supervisor workload network, even if the backend IP behind the Cluster IP is on the same subnetwork as the emitting vSphere Pod. There is a chance that this traffic is blocked around the gateway for two possible reasons:
On the gateway, incoming UDP or TCP packets are supposed to be re-routed to the same network interface as the one from where they are coming from. This alone could be denied by security features (e.g router RPF check).
While packets from pod to ClusterIP will go through the gateway, packets in the way back won’t if the pod and the backend behind the ClusterIP are on the same subnetwork. In the case of a TCP connection, it means the gateway will only see the 1st and the 3rd packets of the TCP 3-way handshake. A firewall could block this traffic.
VMware is aware of this issue and working to resolve this in a future release.
There are two possible workarounds: