NSX Malware Prevention and Network Detection and Response upgrade to 3.2.1 or 3.2.1.1 fails with pods in ImagePullBackOff state

Products

VMware NSX

Issue/Introduction

Customer will not be able to use NSX ATP 3.2.1 or NSX ATP 3.2.1.1 build.

Malware Prevention and Network Detection and Response upgrade is failing in below scenarios:
- From NSX Advanced Threat Prevention (ATP) 3.2.0 to NSX ATP 3.2.1 / 3.2.1.1
- From NSX ATP 3.2.1 to NSX ATP 3.2.1.1

Other Symptoms:

Failed status of NDR and cloud-connector pods will be shown on Upgrade UI screen.
For NDR upgrade, few pods with prefix as "nsx-ndr" will be in ImagePullBackOff state.
For MPS upgrade, few pods with prefix as "cloud-connector" will be in ImagePullBackOff state.
Although upgrade fails once customer hits the upgrade button, MPS & NDR functionality still work the same as before. Note that failure in upgrade only impacts upgrade but does NOT impact any of existing functionality.

Log location: NAPP support bundle

Cause

In upgrade workflow, user needs to mention helm and docker repositories, from where the latest images should be pulled for platform and installed features.
For MPS and NDR, pods are pointing to incorrect docker registry. As a result, some pods have an ImagePullBackOff status.

Resolution

Resolution will be available in NSX Advanced Threat Prevention 4.0.1

Workaround:

SSH into NSX Manager, elevate to root user with st en
Identify the pods which are in ImagePullBackOff
Command: napp-k get pods | grep "ImagePullBackOff"

NDR failing pods
NAME READY STATUS RESTARTS AGE
nsx-ndr-upload-config-5c56785b85-qv64h 0/2 ImagePullBackOff 0 6d
nsx-ndr-worker-file-event-processor-7f55cf97d6-d6d8p 0/2 ImagePullBackOff 0 6d
nsx-ndr-worker-file-event-uploader-d48c7fbd-smvtz 0/2 ImagePullBackOff 0 6d
nsx-ndr-worker-ids-event-processor-7f96d9c87f-wp929 0/2 ImagePullBackOff 0 6d
nsx-ndr-worker-monitored-host-uploader-85d6d46fdc-nd7g4 0/2 ImagePullBackOff 0 6d
nsx-ndr-worker-ndr-event-processor-6947fb9cb8-jj5kh 0/2 ImagePullBackOff 0 6d
nsx-ndr-worker-ndr-event-uploader-578b5dbfb-2s9j8 0/2 ImagePullBackOff 0 6d

MPS failing pods
NAME READY STATUS RESTARTS AGE
cloud-connector-check-license-status-5dffd77ff4-9zpff 0/2 ImagePullBackOff 0 3m27s
cloud-connector-proxy-78b7fb7857-zf5gr 0/2 ImagePullBackOff 0 3m27s
cloud-connector-update-license-status-795d865864-x7b52 0/2 ImagePullBackOff 0 3m27s
reputation-service-5d498b65f8-2htvx 0/1 ImagePullBackOff 0 24s
reputation-service-feature-switch-watcher-notifier-dependedr2nn 0/1 ImagePullBackOff 0 76s
Get the deployment name for the failing pods by matching the prefix
Command: napp-k get deployments

NDR deployments
NAME READY UP-TO-DATE AVAILABLE AGE
nsx-ndr-upload-config 1/1 1 1 163m
nsx-ndr-worker-file-event-processor 1/1 1 1 4h25m
nsx-ndr-worker-file-event-uploader 1/1 1 1 3h13m
nsx-ndr-worker-ids-event-processor 1/1 1 1 3h13m
nsx-ndr-worker-monitored-host-uploader 1/1 1 1 3h13m
nsx-ndr-worker-ndr-event-processor 1/1 1 1 3h13m
nsx-ndr-worker-ndr-event-uploader 1/1 1 1 3h13m

MPS deployments
NAME READY UP-TO-DATE AVAILABLE AGE
cloud-connector-check-license-status 1/1 1 1 4h25m
cloud-connector-proxy 1/1 1 1 3h13m
cloud-connector-update-license-status 1/1 1 1 3h13m
reputation-service 1/1 1 1 3h13m
reputation-service-feature-switch-watcher-notifier 1/1 1 1 3h13m
Edit the deployment and update the image tag with the correct docker registry (the one that was provided by the user during upgrade workflow). Note that only the registry part of the image field should be updated: "harbor.nsbu.eng.vmware.com/nsx_intelligence_ob/clustering"

Command: napp-k edit deployment cloud-connector-check-license-status
This will get opened in vi editor mode. If you want to open in file editor mode, execute this additional command:
export KUBE_EDITOR=vim.tiny

For instance, if the docker registry provided by the user was "projects.registry.vmware.com/nsx_application_platform/clustering", then we need to make the below update.

Existing value example:
image: harbor.nsbu.eng.vmware.com/nsx_intelligence_ob/clustering/nsx-cloud-connector-check-nsx-licensing-status-with-lastline-cloud:123-c33a1aa7.bionic

Corrected value:
image: projects.registry.vmware.com/nsx_application_platform/clustering/nsx-cloud-connector-check-nsx-licensing-status-with-lastline-cloud:123-c33a1aa7.bionic
Repeat step 4 for all the deployments mentioned in Step 3 for a given vertical/feature.
Note that for MPS, we will not see cloud connector and reputation service failing pods at the same time.
a. Workaround needs to be applied for cloud connector pods first.
b. Once upgrade of cloud connector pods are successful, then we will see reputation service pods in ImagePullBackOff state.
c. As and when we see new ImagePullBackOff pods coming up, we need to apply the workaround.
After executing the above steps, upgrade will be successful and the status can be seen as Complete on UI. Also, we can verify the same by executing below command and validate the version.
Command: napp-h list

Continue to monitor backend pods again after successful Upgrade to check if any pod is in ImagePullbackOff state. If so, then we have to repeat steps 2, 3 & 4 mentioned above.

Additional Information

After upgrade, if a user wants to uninstall the MPS or NDR feature feature, execute the below commands for force deletion.
Commands:
napp-k delete job cloud-connector-reset --grace-period=0 --force
napp-k delete job cloud-connector-cleanup --grace-period=0 --force