Supervisor Services fail to reconcile due to mgmt-image-proxy LoadBalancer VIP TLS handshake failure

Products

VMware vSphere Kubernetes Service

Issue/Introduction

Supervisor Services such as VKS Cluster Management Service fail in the Supervisor Services UI with a ReconcileFailed status. The failure occurs during package reconciliation when vendir/imgpkg attempts to fetch images from the mgmt-image-proxy service.

TLS handshake timeout or connection reset when accessing
curl -vk https://mgmt-image-proxy.kube-system.svc.cluster.local/v2/
Describe of Supervisor service such as VKS Cluster Management Service has the following errorReason: ReconcileFailed. Message: vendir: Error: Syncing directory 'O': Syncing directory '.' with imgpkgBundle contents: Fetching image: Error while preparing a transport to talk with the registry: Unable to create round tripper: Get
"https://mgmt-image-proxy.kube-system.svc.cluster.local/v2/": net/http: TLS handshake timeout; Get "http://mgmt-image-proxy.kube-system.svc.cluster.local/v2/": dial tcp <mgmt-image-proxy_LoadBalancer_VIP>:80: connect: connection refused.
Supervisor Services remain in failed or partially configured state

Environment

VMware vSphere Kubernetes Service

Cause

An MTU mismatch exists in the Service LoadBalancer (VIP) datapath for mgmt-image-proxy, commonly seen with Supervisor deployed using VPC networking. The application and Kubernetes objects are healthy (pods/endpoints/ClusterIP), but TLS fails only when traffic goes through the LoadBalancer VIP. This is consistent with packet fragmentation/blackholing along the overlay/TEP path (Geneve overhead), where edges/hosts (TEP SVIs) remain at MTU 1500 while the environment expects jumbo frames.

Resolution

Fix MTU consistency across the VPC/overlay datapath (especially TEP SVIs on edges and hosts), then re-verify VIP TLS connectivity and retry Supervisor Service reconciliation.

Note: Exact MTU values depend on platform/network design (jumbo frames commonly used). The key is end-to-end consistency with overlay encapsulation overhead.

Troubleshooting Steps:

1. Confirm the failing Supervisor Service package install

2. Validate mgmt-image-proxy Service/LB configuration and endpoints

Expected:

- Service type = LoadBalancer
- Endpoints present (multiple pod IPs)
- VIP present in LoadBalancer Ingress

3. Validate DNS resolution (service name > VIP)

Expected: resolves to the LoadBalancer VIP (example: 10.##.##.##).

4. Isolate datapath: test Pod IPs (bypass Service + VIP)

Get the registry/pod IPs (from endpoints output), then:

Expected: HTTP/1.1 200 OK

5. Isolate datapath: test ClusterIP (bypass VIP)

Get ClusterIP from the service output, then:

Expected: HTTP/1.1 200 OK

6. Reproduce failure: test via Service DNS / VIP (this should fail)

Common failure indicators:

- TLS handshake timeout
- Recv failure: Connection reset by peer
- connection refused (port 80)

7. Quick path checks (helps point to MTU blackhole behavior)

If tracepath stalls with repeated no reply after the first hop, it can support MTU/ICMP handling issues in the path.

8. Capture traffic during the failing TLS handshake (optional but strong evidence)

Run while reproducing the curl failure to VIP:

Look for:

- SYN/SYN-ACK works but TLS data exchange gets disrupted
- Long stalls followed by FIN/RST behavior

9. Post-fix validation (after MTU correction)

Re-run the exact same sequence to confirm only VIP path changes:

Then confirm Supervisor Services progress:

Additional Information

This is not a certificate issue in this scenario.
Backend registry pods and ClusterIP can work while VIP fails, because the VIP datapath introduces additional network components (Service LB / VPC routing / overlay path).
If the VIP fails only from VPC subnets but works from VLAN-backed networks, it strongly indicates a VPC datapath MTU/overlay issue.