Supervisor Services such as VKS Cluster Management Service fail in the Supervisor Services UI with a ReconcileFailed status. The failure occurs during package reconciliation when vendir/imgpkg attempts to fetch images from the mgmt-image-proxy service.
TLS handshake timeout or connection reset when accessingcurl -vk https://mgmt-image-proxy.kube-system.svc.cluster.local/v2/
Reason: ReconcileFailed. Message: vendir: Error: Syncing directory 'O': Syncing directory '.' with imgpkgBundle contents: Fetching image: Error while preparing a transport to talk with the registry: Unable to create round tripper: Get "https://mgmt-image-proxy.kube-system.svc.cluster.local/v2/": net/http: TLS handshake timeout; Get "http://mgmt-image-proxy.kube-system.svc.cluster.local/v2/": dial tcp <mgmt-image-proxy_LoadBalancer_VIP>:80: connect: connection refused.
VMware vSphere Kubernetes Service
An MTU mismatch exists in the Service LoadBalancer (VIP) datapath for mgmt-image-proxy, commonly seen with Supervisor deployed using VPC networking. The application and Kubernetes objects are healthy (pods/endpoints/ClusterIP), but TLS fails only when traffic goes through the LoadBalancer VIP. This is consistent with packet fragmentation/blackholing along the overlay/TEP path (Geneve overhead), where edges/hosts (TEP SVIs) remain at MTU 1500 while the environment expects jumbo frames.
Fix MTU consistency across the VPC/overlay datapath (especially TEP SVIs on edges and hosts), then re-verify VIP TLS connectivity and retry Supervisor Service reconciliation.
Note: Exact MTU values depend on platform/network design (jumbo frames commonly used). The key is end-to-end consistency with overlay encapsulation overhead.
Troubleshooting Steps:
1. Confirm the failing Supervisor Service package install
2. Validate mgmt-image-proxy Service/LB configuration and endpoints
Expected:
Service type = LoadBalancer
Endpoints present (multiple pod IPs)
VIP present in LoadBalancer Ingress
3. Validate DNS resolution (service name > VIP)
Expected: resolves to the LoadBalancer VIP (example: 10.##.##.##).
4. Isolate datapath: test Pod IPs (bypass Service + VIP)
Get the registry/pod IPs (from endpoints output), then:
Expected: HTTP/1.1 200 OK
5. Isolate datapath: test ClusterIP (bypass VIP)
Get ClusterIP from the service output, then:
Expected: HTTP/1.1 200 OK
6. Reproduce failure: test via Service DNS / VIP (this should fail)
Common failure indicators:
TLS handshake timeout
Recv failure: Connection reset by peer
connection refused (port 80)
7. Quick path checks (helps point to MTU blackhole behavior)
If tracepath stalls with repeated no reply after the first hop, it can support MTU/ICMP handling issues in the path.
8. Capture traffic during the failing TLS handshake (optional but strong evidence)
Run while reproducing the curl failure to VIP:
Look for:
SYN/SYN-ACK works but TLS data exchange gets disrupted
Long stalls followed by FIN/RST behavior
9. Post-fix validation (after MTU correction)
Re-run the exact same sequence to confirm only VIP path changes:
Then confirm Supervisor Services progress:
This is not a certificate issue in this scenario.
Backend registry pods and ClusterIP can work while VIP fails, because the VIP datapath introduces additional network components (Service LB / VPC routing / overlay path).
If the VIP fails only from VPC subnets but works from VLAN-backed networks, it strongly indicates a VPC datapath MTU/overlay issue.