This article allows to quickly identify this known problem and fix it using one of the suggested workarounds.
vSphere Pods can be used to run Supervisor Services.
Symptoms:
This issue only happens on a Supervisor cluster deployed with VDS Networking.
There are different possible symptoms:
vSphere Pod fails at deployment with pod state set to ErrImagePull. The command kubectl describe pod <podname> shows a failure when UDP is used to perform DNS name lookup during image fetching. For example, this is the kind of message that can be seen if the issue occurs when you install a supervisor service:
failed to get images: Image <vsphere pod namespace>/<vsphere pod instance> has failed. Error: Failed to resolve on node <nodename>. Reason: Http request failed. Code 400: ErrorType(2) failed to do request: Head https://projects.packages.broadcom.com/v2/tkg/<supervisor service>/manifests/sha256:<hash>: dial tcp: lookup projects.packages.broadcom.com: i/o timeout
vSphere Pod fails slightly after deployment with pod state set to CrashLoopBackOff. The command kubectl logs <podname> shows a failure during a TCP connection to a cluster IP. For example, this is the kind of error message that can be seen if the issue occurs when you install a contour service:
time="YYYY-MM-DDTHH:MM:SSZ" level=fatal msg="unable to initialize Server dependencies required to start <supervisor service>" error="unable to set up controller manager: Get \"https://<IP>:443/api?timeout=32s\": net/http: TLS handshake timeout"
kubectl get imagedisks -A -o yaml | grep <supervisor service name>
VMware vSphere ESXi 8.0.x
VMware vCenter Server 8.0.x
vSphere Supervisor 8.0 and higher
On Supervisor deployed with VDS, any UDP or TCP traffic sent from a vSphere Pod to a ClusterIP will go through the gateway of the supervisor workload network, even if the backend IP behind the Cluster IP is on the same subnetwork as the emitting vSphere Pod. There is a chance that this traffic is blocked around the gateway for the following possible reasons:
On the gateway, incoming UDP or TCP packets are supposed to be re-routed to the same network interface as the one from where they are coming from. This alone could be denied by security features (e.g router RPF check).
While packets from pod to ClusterIP will go through the gateway, packets returning will not be able to if the pod and the backend behind the ClusterIP are on the same subnetwork. In the case of a TCP connection, it means the gateway will only see the 1st and the 3rd packets of the TCP 3-way handshake. A firewall could block this traffic.
VMware is aware of this issue and working to resolve this in a future release. vSphere 9.0 will provide more verbose logging and information regarding these issues.
See the below for possible workarounds:
kubectl get services -n kube-system | grep -i dns
kube-dns ClusterIP <kube-dns IP> <none> <port>/UDP,<port>/TCP,<port>/TCP ##d