VKS cluster provisioning fails with "context deadline exceeded" during back-to-back recreation in VCF 9.x

search cancel

VKS cluster provisioning fails with "context deadline exceeded" during back-to-back recreation in VCF 9.x

book

Article ID: 439698

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

In VMware Cloud Foundation 9.x, when a VKS cluster is deleted and immediately recreated using the same namespace and name, the control plane may fail to become ready. The API server remains unreachable, and cluster status logs report an error similar to:

YYYY-MM-DDT HH:MM:SS failed to get server groups: Get "https://##.###.##.##:6443/api?timeout=10s": context deadline exceeded

The newly assigned Virtual IP (VIP) is often different from the previous one, but the cluster remains in a stale state referencing the old configuration.

Environment

VMware Cloud Foundation (VCF) 9.0.1, 9.0.2, 9.1.0
vSphere Distributed Switch (VDS) enabled
vSphere Kubernetes Service

Cause

The issue is caused by stale state maintained by the flb-controller deployment within the Supervisor Cluster. The controller retains VIP allocation data from the deleted cluster instance, leading to a mismatch when the new cluster instance attempts to initialize its control plane.

Resolution

To resolve this issue, the flb-controller state must be refreshed by restarting the deployment.

Using kubectl, switch context to the affected Supervisor Cluster.

Restart the flb-controller deployment in the vmware-system-flb namespace:

kubectl rollout restart deployment -n vmware-system-flb flb-controller

Ensure the deployment has completed the rollout:

kubectl get pods -n vmware-system-flb -l app=flb-controller

Re-apply the VKS cluster manifest.

Additional Information

To prevent this issue without a controller restart, users are advised to wait several minutes between deletion and recreation, or use a unique name/namespace for the new cluster.

Feedback

thumb_up Yes

thumb_down No