Multiple TKCs went into NotReady state after upgrading the Supervisor Cluster to v1.24.9
search cancel

Multiple TKCs went into NotReady state after upgrading the Supervisor Cluster to v1.24.9

book

Article ID: 319378

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • After upgrading the Supervisor Cluster to version: " v1.24.9+vmware.1-vsc0.1.4-21450065" multiple TKCs are in "NotReady" state.

  • VMOP Pod logs show the below error:

# kubectl logs -n vmware-system-vmop vmware-system-vmop-controller-manager-5b8c97598d-96smg manager

...

E0512 16:49:50.916840    1 network_provider.go:681] vsphere "msg"="Failed to create vnetIf for vif" "error"="timed out waiting for the condition" "vmName"="ns-infra/test1test-gdrnm-g9ckq" "vif"={"networkType":"nsx-t","networkName":"test1test-4g8k6-vnet"}

...

  • NCP pod logs show the below error:

# kubectl logs -n vmware-system-nsx vmware-system-nsx-ncp-5b8c97598d-96smg nsx-ncp

...

[ncp GreenThread-117 I] nsx_ujo.common.controller VirtualNetworkInterfaceController worker 1 failed to sync 91f14c74-2cdd-461d-874d-f98a432208fg due to multiple object exception: Multiple K8sGroup objects were found for {'project': 'e554rew3-1cd8-4b49-9626-5jgru485144be', 'group_type': 'ncp/lb_sourceip'}

...

  • vnets are created but the describe output shows the following error:

# kubectl describe vnetif -n ns-infra test1test-5hn6c-vnet-teste9999-w6cdg-8lphn-lsp

...

Spec:

 Virtual Network: test1test-5hn6c-vnet

Status:

Events:

 Type   Reason          Age  From        Message

 ----   ------          ---- ----        -------

 Warning FailedRealizeNSXResource 40m  nsx-container-ncp Generic error occurred during realizing network for VirtualNetworkInterface

...

  • New VMs aren't powering ON:

# kubectl get vm -n ns-infra | grep test1test

virtualmachine.vmoperator.vmware.com/test1test-gdrnm-g9ckq    poweredOff  97m

Cause

Two lb_source_groups are created for the same project; this prevents NCP from identifying the correct virtualnetwork, blocking NCP from updating the source group to include the new ControlPlaneVM segment.

Resolution

This issue is fixed in vSphere 8.0 Update 2 


Workaround:

The workaround to fix the issue is deleting the duplicated group in NSX-NCP, please follow the steps below:

1. Identify the duplicated group by checking the NCP log:

 

# kubectl logs -n vmware-system-nsx vmware-system-nsx-ncp-5b8c97598d-96smg nsx-ncp

... 

'parent_path': '/infra/domains/domain-c####:<id>/groups/src_####-####-####-####-####_lb_sourceip'

...

'parent_path': '/infra/domains/domain-c####:<id>/groups/src_ns-infra_lb_sourceip'

...
 

NOTE: The old GOOD lb_sourceip will be the one that contains src_ID_lb_sourceip, the new BAD group will only contain src_ns-infra_lb_sourceip
 

2. Compare Groups noted in the error and use the NSX API to query the groups :

# curl -ku admin https://<NSX-IP-MANAGER>/policy/api/v1/infra/domains/domain-c####:<id>/groups/src_ns-infra_lb_sourceip
 

{

...

"display_name" : "src-ns-infra-lb-sourceip",

...

}

 

#curl -ku admin https://<NSX-IP-MANAGER>/policy/api/v1/infra/domains/domain-c####:<id>/groups/src_####-####-####-####-####_lb_sourceip

{

...

"display_name" : "src-ns-infra-lb-sourceip",

...

}

 

3. Delete the duplicate group after verifying it using the NSX API:

# curl -sku 'admin' https://<NSX-IP-MANAGER>/policy/api/v1/infra/domains/domain-c####:<id>/groups/src_ns-infra_lb_sourceip -H "X-Allow-Overwrite:true" -X DELETE

 

4. Finally, restart the NCP pod to pick up the new configuration:

nsx_id=$(kubectl get pods -A | grep -i nsx | awk '{ print $2 }') && kubectl delete pod -n vmware-system-nsx $nsx_id