Unable to Create Workload Cluster Nodes due to Stale NSX-T Subnets in vSphere Supervisor
search cancel

Unable to Create Workload Cluster Nodes due to Stale NSX-T Subnets in vSphere Supervisor

book

Article ID: 402228

calendar_today

Updated On:

Products

VMware NSX VMware vSphere Kubernetes Service VMware NSX for vSphere

Issue/Introduction

When creating Workload Cluster nodes in a vSphere Supervisor environment with NSX-T, the new nodes never reach Running state.

This can occur during rolling redeployments or workload cluster upgrades which also operate on rolling redeployment logic of new node first before cleaning up the older node.

 

While connected to the Supervisor Cluster context, the following symptoms are observed:

  • One or more workload cluster machines are stuck in Provisioned or Provisioning state:
    • kubectl get machine -n <workload cluster namespace>
  • The NSX-NCP pod is healthy in Running state:
    • kubectl get pods -n vmware-system-nsx

While connected to the Workload Cluster context, the following symptoms are observed:

  • Describing the new node shows the following taint:
    • "node.kubernetes.io/network-unavailable"

 

There are no alarms in the NSX-T web UI regarding NCP health.

Environment

vSphere Supervisor 7.X

vSphere Supervisor 8.X

Cause

After a rolling redeployment of nodes in a Workload Cluster within a vSphere Supervisor environment using NSX-T, one or more subnets are leftover.

Resolution

Resolution

A fix for the vSphere CPI issue of not properly sending deletion requests for the subnet from the IPPool object will be available in an upcoming VKS service Supervisor Service.

 

Workaround

The stale subnet entries can be manually cleaned up from the IPPool object in the Supervisor cluster context.

Once removed from the IPPool object, the clean up will propagate to the NSX side.

  1. Connect into the Supervisor Cluster context

  2. Locate IPPool object:
    • kubectl get ippool -n <workload cluster namespace>
  3. Perform a describe on the IPPool object and navigate to the spec.subnets section:
    • kubectl describe ippool -n <workload cluster namespace> <ippool name>
  4. Determine which of the noted subnets in the ippool for the workload cluster are stale because they are associated with missing nodes:
    • This can be compared to the current Running nodes in the workload cluster:
      • kubectl get machines -n <workload cluster namespace>
  5. Carefully remove only the subnet entry/entries for the stale subnet(s) which are associated with the missing node(s):
    • kubectl edit ippool -n <workload cluster namespace> <ippool name>
    • spec:
          subnets:
          - ipFamily: ipv4
          name: <missing node name>
          prefixLength: 24
          - ipFamily: ipv4
          name: <existing node name>
            prefixLength: 24

  6. Confirm that the stale subnet is cleaned up.