VKS Worker Nodes Stuck in Provisioned State due to "connection refused" reaching API Server VIP
search cancel

VKS Worker Nodes Stuck in Provisioned State due to "connection refused" reaching API Server VIP

book

Article ID: 437729

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime

Issue/Introduction

  • Supervisor or Workload Cluster worker nodes remain stuck in the "Provisioned" state and fail to join the cluster.

  • While the VMs are powered on and assigned IP addresses, they never transition to a "Ready" state.

  • Reviewing the /var/log/cloud-init-output.log on the impacted worker nodes (via SSH or VM console) shows the following error during the kubeadm join phase.

    [discovery] Retrying due to error: failed to request the cluster-info ConfigMap: Get "https://[IP_ADDRESS]:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": dial tcp [IP_ADDRESS]:6443: connect: connection refused

Environment

VMware vSphere Kubernetes Service (VKS)
NSX Advanced Load Balancer (AVI)

Cause

The worker nodes are unable to retrieve the cluster-info ConfigMap because the Load Balancer (NSX Advanced Load Balancer / AVI) Virtual Service (VS) is down or misconfigured. The "connection refused" error at the TCP layer indicates that the Load Balancer VIP is reachable, but the service is not accepting traffic on port 6443.

Resolution

1. Verify Load Balancer Status

Log in to the NSX Advanced Load Balancer (AVI) Controller UI and verify the status of the Virtual Service VIP for the impacted cluster:

  • Confirm if the Virtual Service status is Up.
  • Check the Health Score of the Virtual Service and its associated Pool Members (Control Plane nodes).
  • Ensure that the health checks for port 6443 are passing. If the VS is down, check if the associated pool members (which is mostly the control plane nodes) are up or not and if they're able to establish connection over port 6443.

2. Remediate Stuck Nodes

Once the connectivity issue is fixed, nodes that fail the initial bootstrapping process during the first-boot sequence typically require recreation to retry the join process successfully.

  1. Connect to the Supervisor cluster context.
  2. Identify the stuck machine objects in the relevant namespace using the command below.

    kubectl get vm,machines -n <namespace> | grep -i <name of the cluster> 
  3. Safely remediate the concerned machines using the command below.

    kubectl annotate machine -n <ns> <machine-name> 'cluster.x-k8s.io/remediate-machine=""'

Note: Cluster API (CAPI) will automatically trigger the provisioning of new worker nodes to replace the deleted ones.

3. Validation

Monitor the replacement nodes to ensure they successfully reach the API server:

  1. Verify the /var/log/cloud-init-output.log of the newly provisioned worker nodes and ensure that the bootstrap completes without connection errors.
  2. Confirm nodes transition to "Ready" state in the guest cluster context.