Supervisor Services fail to reconcile due to mgmt-image-proxy LoadBalancer VIP TLS handshake failure
search cancel

Supervisor Services fail to reconcile due to mgmt-image-proxy LoadBalancer VIP TLS handshake failure

book

Article ID: 425276

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

Supervisor Services such as VKS Cluster Management Service fail in the Supervisor Services UI with a ReconcileFailed status. The failure occurs during package reconciliation when vendir/imgpkg attempts to fetch images from the mgmt-image-proxy service.

  • TLS handshake timeout or connection reset when accessing
    curl -vk https://mgmt-image-proxy.kube-system.svc.cluster.local/v2/

  • Describe of Supervisor service such as VKS Cluster Management Service has the following error
    Reason: ReconcileFailed. Message: vendir: Error: Syncing directory 'O': Syncing directory '.' with imgpkgBundle contents: Fetching image: Error while preparing a transport to talk with the registry: Unable to create round tripper: Get 

    "https://mgmt-image-proxy.kube-system.svc.cluster.local/v2/": net/http: TLS handshake timeout; Get "http://mgmt-image-proxy.kube-system.svc.cluster.local/v2/": dial tcp <mgmt-image-proxy_LoadBalancer_VIP>:80: connect: connection refused.

  • Supervisor Services remain in failed or partially configured state

Environment

VMware vSphere Kubernetes Service

Cause

An MTU mismatch exists in the Service LoadBalancer (VIP) datapath for mgmt-image-proxy, commonly seen with Supervisor deployed using VPC networking. The application and Kubernetes objects are healthy (pods/endpoints/ClusterIP), but TLS fails only when traffic goes through the LoadBalancer VIP. This is consistent with packet fragmentation/blackholing along the overlay/TEP path (Geneve overhead), where edges/hosts (TEP SVIs) remain at MTU 1500 while the environment expects jumbo frames.

Resolution

Fix MTU consistency across the VPC/overlay datapath (especially TEP SVIs on edges and hosts), then re-verify VIP TLS connectivity and retry Supervisor Service reconciliation.

Note: Exact MTU values depend on platform/network design (jumbo frames commonly used). The key is end-to-end consistency with overlay encapsulation overhead.

Troubleshooting Steps:

1. Confirm the failing Supervisor Service package install

kubectl -n vmware-system-supervisor-services get pkgi
kubectl -n vmware-system-supervisor-services get pkgi <PACKAGEINSTALL_NAME> -o yaml | sed -n '/usefulErrorMessage/,$p'

2. Validate mgmt-image-proxy Service/LB configuration and endpoints

kubectl -n kube-system get svc mgmt-image-proxy -o wide
kubectl -n kube-system describe svc mgmt-image-proxy
kubectl -n kube-system get endpoints mgmt-image-proxy -o wide
kubectl -n kube-system get endpointslice | grep -i mgmt-image-proxy

Expected:

    • Service type = LoadBalancer

    • Endpoints present (multiple pod IPs)

    • VIP present in LoadBalancer Ingress

3. Validate DNS resolution (service name > VIP)

nslookup mgmt-image-proxy.kube-system.svc.cluster.local
dig mgmt-image-proxy.kube-system.svc.cluster.local A +short

Expected: resolves to the LoadBalancer VIP (example: 10.##.##.##).

4. Isolate datapath: test Pod IPs (bypass Service + VIP)

Get the registry/pod IPs (from endpoints output), then:

curl -vk https://<POD_IP_1>:5000/v2/
curl -vk https://<POD_IP_2>:5000/v2/
curl -vk https://<POD_IP_3>:5000/v2/

Expected: HTTP/1.1 200 OK

5. Isolate datapath: test ClusterIP (bypass VIP)

Get ClusterIP from the service output, then:

curl -vk https://<CLUSTER_IP>:443/v2/

Expected: HTTP/1.1 200 OK

6. Reproduce failure: test via Service DNS / VIP (this should fail)

# DNS name (resolves to VIP)
curl -vk https://mgmt-image-proxy.kube-system.svc.cluster.local/v2/

# If you want to test VIP explicitly:
curl -vk https://<VIP>:443/v2/

# Port 80 check (often refused; included for completeness)
curl -v http://mgmt-image-proxy.kube-system.svc.cluster.local/v2/ curl -v http://<VIP>:80/v2/

Common failure indicators:

    • TLS handshake timeout

    • Recv failure: Connection reset by peer

    • connection refused (port 80)

7. Quick path checks (helps point to MTU blackhole behavior)

traceroute mgmt-image-proxy.kube-system.svc.cluster.local
tracepath mgmt-image-proxy.kube-system.svc.cluster.local
tracepath -n <VIP>

If tracepath stalls with repeated no reply after the first hop, it can support MTU/ICMP handling issues in the path.

8. Capture traffic during the failing TLS handshake (optional but strong evidence)

Run while reproducing the curl failure to VIP:

tcpdump -ni any host <VIP> and port 443

Look for:

    • SYN/SYN-ACK works but TLS data exchange gets disrupted

    • Long stalls followed by FIN/RST behavior

9. Post-fix validation (after MTU correction)

Re-run the exact same sequence to confirm only VIP path changes:

# backend pods
curl -vk https://<POD_IP_1>:5000/v2/

# clusterIP
curl -vk https://<CLUSTER_IP>:443/v2/

# VIP / DNS name (should now succeed)
curl -vk https://mgmt-image-proxy.kube-system.svc.cluster.local/v2/
curl -vk https://<VIP>:443/v2/

Then confirm Supervisor Services progress:

kubectl -n vmware-system-supervisor-services get pkgi

# Optional: check events/status for the pkgi that was failing
kubectl -n vmware-system-supervisor-services describe pkgi <PACKAGEINSTALL_NAME>

Additional Information

  • This is not a certificate issue in this scenario.

  • Backend registry pods and ClusterIP can work while VIP fails, because the VIP datapath introduces additional network components (Service LB / VPC routing / overlay path).

  • If the VIP fails only from VPC subnets but works from VLAN-backed networks, it strongly indicates a VPC datapath MTU/overlay issue.