NAPP NSXi pods will not start after Workload Cluster update

Products

VMware vCenter Server

Issue/Introduction

To return the NAPP deployment back to an "Active" state and retore NAPP functionality.

Symptoms:

NAPP will show as "degraded" or "unavailable"
The workload cluster object shows as "ready = true"
Pods within the nsxi-platform namespace within the workload cluster exist in an "Init:0/X" or "CrashLoopBackOff" state like the below example:

root [ /home/vmware-system-user ]# kubectl get pods -A | grep -v Running
NAMESPACE NAME READY STATUS RESTARTS AGE
nsxi-platform common-agent-669fbf8d57-m7w9d 0/1 Init:0/1 0 153m
nsxi-platform correlated-flow-hourly-reindex-27978960-6s2xp 0/1 Completed 0 39m
nsxi-platform druid-rule-monitor-27978990-crtlv 0/1 Completed 0 9m53s
nsxi-platform feature-service-flow-feature-creator-27975000-x6n7c 0/1 Init:1/4 0 153m
nsxi-platform feature-service-name-feature-creator-27975000-ktsfk 0/1 Init:1/4 0 153m
nsxi-platform infraclassifier-pod-cleaner-27978995-tpcd2 0/1 Completed 0 4m53s
nsxi-platform kafka-0 0/1 Init:0/2 2 124m
nsxi-platform kafka-1 0/1 Init:0/2 2 152m
nsxi-platform kafka-2 0/1 Init:0/2 1 89m
nsxi-platform latestflow-588d4f6dc-425nx 0/1 Init:0/1 0 153m
nsxi-platform latestflow-588d4f6dc-b8525 0/1 Init:0/1 0 153m
nsxi-platform latestflow-588d4f6dc-fq8rv 0/1 Init:0/1 0 89m
nsxi-platform latestflow-588d4f6dc-lbrhn 0/1 Init:0/1 0 153m
nsxi-platform llanta-detectors-0 1/4 CrashLoopBackOff 46 83m
nsxi-platform malware-prevention-feature-switch-watcher-notifier-ndr-77cxwl5q 0/1 CrashLoopBackOff 19 83m
nsxi-platform metrics-app-server-765d859454-2nmrm 0/1 Init:0/1 0 153m
nsxi-platform metrics-nsx-config-55449bf557-b4mp9 0/1 CrashLoopBackOff 23 83m
nsxi-platform ncp-multitool 0/1 ImagePullBackOff 0 109m
nsxi-platform nsx-config-67696bd6fb-7s5hd 0/1 Init:1/5 0 153m
nsxi-platform nsx-ndr-feature-switch-watcher-notifier-ndr-6b97b5c75b-cn6kj 0/1 CrashLoopBackOff 19 83m
nsxi-platform nsx-ndr-worker-file-event-processor-8564dc7dd5-776dt 1/2 CrashLoopBackOff 20 83m
nsxi-platform nsx-ndr-worker-file-event-uploader-76ffc4c86c-qfbw2 1/2 CrashLoopBackOff 20 83m
nsxi-platform nsx-ndr-worker-ids-event-processor-7c55478b9d-5g9wj 1/2 CrashLoopBackOff 21 83m
nsxi-platform nsx-ndr-worker-monitored-host-processor-755cb77448-cblxf 1/2 CrashLoopBackOff 27 153m
nsxi-platform nsx-ndr-worker-monitored-host-uploader-864d49ff86-qzdsx 1/2 CrashLoopBackOff 20 83m
nsxi-platform nsx-ndr-worker-ndr-event-processor-5f5f5d849f-8d724 1/2 CrashLoopBackOff 20 83m
nsxi-platform nsx-ndr-worker-ndr-event-uploader-6799c9b75-npn5x 1/2 CrashLoopBackOff 20 83m
nsxi-platform nsx-ndr-worker-nta-event-processor-6c979dfbf6-2dtnf 1/2 CrashLoopBackOff 27 153m
nsxi-platform nsxi-platform-fluent-bit-96dnr 0/1 Init:0/1 0 141m
nsxi-platform nsxi-platform-fluent-bit-ltbtb 0/1 Init:0/1 0 140m
nsxi-platform nsxi-platform-fluent-bit-rqnvf 0/1 Init:0/1 0 150m
nsxi-platform nsxi-platform-fluent-bit-tbdk5 0/1 Init:0/1 0 89m
nsxi-platform nsxi-platform-fluent-bit-xxzwr 0/1 Init:0/1 0 150m
nsxi-platform nta-server-c7f7dc97-8kxdb 0/2 Init:1/5 0 89m
nsxi-platform pod-cleaner-27978990-km4xq 0/1 Completed 0 9m53s
nsxi-platform pubsub-699b76b656-vwklc 0/1 Init:0/1 0 153m
nsxi-platform recommendation-86c77c9f48-nhqdc 0/2 Init:1/3 0 153m
nsxi-platform reputation-service-feature-switch-watcher-notifier-dependehtb2l 0/1 CrashLoopBackOff 20 89m
nsxi-platform spark-app-context-driver 0/2 Init:0/3 0 71s
nsxi-platform spark-app-overflow-driver 0/2 Init:1/4 0 4m40s
nsxi-platform spark-app-rawflow-driver 1/2 Error 0 6m10s
nsxi-platform trust-manager-677cccff48-8smvl 0/1 CrashLoopBackOff 19 83m
nsxi-platform workload-9b7947578-sjp96 0/1 Init:0/3 0 153m

Environment

VMware vCenter Server 7.0.x
VMware vCenter Server 8.0

Cause

This issue applies to NAPP deployments which utilize a single control plane node. When the cluster spawns a new control plane node the kubeapi-egress-networkpolicy does not get updated with the IP of the new control plane node. Therefore, if the IP of the control plane node has changed the network policy drops all traffic and leaves the pods unable to start.

Resolution

This will be fixed in a future release.

Workaround:
To work around this issue please follow the below steps:

Log into your workload cluster using an account with "owner" permissions on the namespace
- # kubectl vsphere login --server=SUPERVISOR-CLUSTER-CONTROL-PLANE-IP --tanzu-kubernetes-cluster-name NAPP-CLUSTER-NAME --tanzu-kubernetes-cluster-namespace NAMESPACE-WHERE-NAPP-CLUSTER-IS-DEPLOYED --vsphere-username VCENTER-SSO-USER-NAME
- The bolded fields above indicate an environmental variable that needs to be modified for the command to work for your cluster
Ensure your context is set to the napp cluster
- # kubectl config use-context NAPP-CLUSTER
Note the ipv4 address of the single control plane node for the cluster
- You can find this using the vSphere client by selecting the VM from the navigator pane
Edit the network policy within the nsxi-platform namespace and update the cidr: section with the IP captured in the previous step
- # kubectl edit networkpolicy -n nsxi-platform kubeapi-egress-networkpolicy
- Example of what the file will look like:

Once the edit is complete type esc :wq! to write quit and save the file and persist the edit
Pods should restart and come up after the edit is complete