VKS Supervisor Enablement stuck at "Configured Supervisor Control plane VM's Workload Network" when using AVI and NSX

search cancel

VKS Supervisor Enablement stuck at "Configured Supervisor Control plane VM's Workload Network" when using AVI and NSX

book

Article ID: 406786

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

When enabling Supervisor with AVI load balancer and NSX, the process gets stuck at "Configured Supervisor Control plane VM's Workload Network".

In the Supervisor nodes, the workload network interface (eth1) is not configured. As a result, 1 out of 3 CoreDNS pods is in CrashLoopBackOff, and the other 2 are Pending.

# kubectl get pods -A
NAMESPACE     NAME                        READY   STATUS             RESTARTS         AGE
kube-system   coredns-84787595bc-88gqg    0/1     CrashLoopBackOff   21 (2m55s ago)   71m
kube-system   coredns-84787595bc-ntpqv    0/1     Pending            0                71m
kube-system   coredns-84787595bc-wf7bt    0/1     Pending            0                71m

READY column value is empty.

# kubectl -n kube-system get virtualnetworkinterfaces
NAME          READY   AGE
vif-vm-1038           70m
vif-vm-1039           65m
vif-vm-1040           65m

One nsx-ncp pod is Running in the FIP Supervisor node, but it does not produce any useful error messages.

# kubectl -n vmware-system-nsx get pods
NAME                       READY   STATUS    RESTARTS      AGE
nsx-ncp-7b47d9cf67-cmfv9   2/2     Running   3 (93m ago)   106m
nsx-ncp-cb55775b5-ctjpl    0/2     Pending   0             3s

Environment

vSphere Kubernetes Service - vSphere 8U3

NSX Networking Stack with AVI LoadBalancer configured

Cause

The nsx-ncp pod entered Restore mode and did not return to Normal mode, which caused workload network segment creation to stop.

This issue occurs in environments that use both AVI and NSX, if the NSX Manager has been restored previously.

Resolution

Switch the nsx-ncp pod from Restore mode to Normal mode.

1. Log in to the Supervisor node via SSH

Follow the KB: Troubleshooting vSphere Supervisor Control Plane VMs

2. Get restored_end_time from NSX Manager via API

NSX_FQDN=<NSX_MANAGER_FQDN>
NSX_PASS=<NSX_MANAGER_PASSWORD>

RESTORE_END_TIME=$(curl -ks -u admin:"${NSX_PASS}" -X GET https://${NSX_FQDN}/api/v1/cluster/restore/status | jq -r .restore_end_time)

echo $RESTORE_END_TIME
#> 1754697444409

3. Patch nsx-restore-status

kubectl patch ncpconfig nsx-restore-status --type='merge' -p "{\"metadata\":{\"annotations\":{\"restore_end_time\":\"${RESTORE_END_TIME}\"}}}"

4. Restart the nsx-ncp pod manually

kubectl -n vmware-system-nsx delete pod nsx-ncp-xxxxxxxxx

# Check
kubectl -n vmware-system-nsx get pods
#> NAME                      READY   STATUS    RESTARTS   AGE
#> nsx-ncp-cb55775b5-np2tg   0/2     Pending   0          7s
#> nsx-ncp-cb55775b5-vxj9v   2/2     Running   0          44s

4. Workload Network IF will be created successfully. The Supervisor Enablement process will also be resumed.

kubectl get virtualnetworkinterfaces -n kube-system
#> NAME          READY   AGE
#> vif-vm-1038   True    179m
#> vif-vm-1039   True    175m
#> vif-vm-1040   True    175m

Feedback

thumb_up Yes

thumb_down No