503 Service Unavailable Exception seen while running recommendations or in the notifications alert

Products

VMware vDefend Firewall with Advanced Threat Prevention VMware vDefend Firewall

Issue/Introduction

You are running NSX Application Platform (NAPP) and you are unable to select virtual machines in the Start Recommendations dialog due to 503 Service Unavailable Exception.

You may also see an alarm notification as seen here :

Error Message: "Http failure response for https://.../api/intelligence/workload/composer/compute: 503 Service Unavailable".

Environment

NAPP 4.2 and prior

Cause

Workload service is responsible for serving the apis to return the virtual machines that are classified as Infrastructure virtual machines. These classifications are stored in the workload database by the Infra classifier service. Infra classifier will process the flows received from the hosts to determine if the source or destination is running Infrastructure service. If there are a large number of private IPs in the flows received from the host, where the IPs do not belong to any managed NSX entities, then, it is possible that several of these IPs are also written into the workload database. Eventually the size of the database can get large and cause Out of Memory exception in the workload service while attempting to retrieve the classified virtual machines.

Identify if workload pod currently is experiencing OutOfMemory exceptions via running the command on NSX manager using root user credentials

% napp-k get pods | grep workload

% napp-k logs workload-<12345678>

(replace the string in <> with the pod name seen in the setup)

...
2024-09-23T00:00:32.39322057Z stderr F Exception in thread \"https-jsse-nio-7669-Poller\" java.lang.OutOfMemoryError: Java heap space"
...

Resolution

Use these steps if you are interested in preserving the functionality for Infra Classifier service. This service is useful to classify workloads as infrastructure machines and can be excluded from consideration in recommendation analysis. If you frequently activate this option while using recommendation, then, you can follow the steps here to fix the infra classifier service functionality and prevent out of memory condition.

If you keep the default setting for deactivating exclude infrastructure workloads, then you can follow steps in this KB to workaround the issue by disabling Infra Classifier service : https://knowledge.broadcom.com/external/article?articleNumber=378823

Create a background job that runs once a day (4:00 am), to unconditionally remove any unmanaged internal computes, e.g. IP-123.45.67.8, from workload db. Additionally remove any vms deleted from inventory which could also add load to workload service.

Note: the below steps should be run on the NSX manager with root access. If you are unable to run it on the manager then you can run it on any system where you have access to the kubeconfig file to access the cluster
Download the attachment workaroundForWorkloadDB.tar.gz, and do the following to apply the patch

Verify the integrity of file

% md5sum workaroundForWorkloadDB.tar.gz

a5ea7a817c09066796b8a27425db24db workaroundForWorkloadDB.tar.gz
Extract the tar

% tar zxvf workaroundForWorkloadDB.tar.gz
x cleanDb/
x cleanDb/clean_workload.sh
x cleanDb/cronjob_clean_workload.yaml

Verify the owner is root:

% cd cleanDb
cleanDb % ls -l
total 16
-rwxrwxr-x@ 1 root wheel 1710 Sep 16 15:15 cronjob_clean_workload.yaml
-rwxrwxr-x@ 1 root wheel 2075 Oct 7 11:23 clean_workload.sh

Locate your kubeconfig file

$ alias | grep napp-k
alias napp-k='kubectl --kubeconfig <Your Kubeconfig File> -n nsxi-platform'

Run the script, if there are issues with interpretting the bash shebang due to putty or windows hosts, try the alternative command

$ ./clean_workload.sh --kubeconfig=<Your Kubeconfig File>OR
$ bash clean_workload.sh --kubeconfig=<Your Kubeconfig File>

Upon successfuly execution, your terminal should print something like below

$ ./clean_workload.sh --kubeconfig=".kube/config"
Using cronjob spec cronjob_clean_workload.yaml
clients images found: ssp.napp.local/clustering/third-party/clients:23624728
ssp.napp.local/clustering/third-party/clients@sha256:6cbc50d76f0d050423ede6f2bf6a8d9df210be3b9726e1f733fa75e534ce7efb
Selected clients image: ssp.napp.local/clustering/third-party/clients@sha256:6cbc50d76f0d050423ede6f2bf6a8d9df210be3b9726e1f733fa75e534ce7efb
Replacement successful
cronjob.batch/clean-workload configured
Successfully created cronjob

Once the background periodic job is created, you may need to wait for up to 24 for the cronjob to run. Refer to step 2 to immediately clean db.

Optional: To immediately address the issue without waiting for the cronjob detailed in later steps to trigger, use the following command to immediately clean up the entries:

$ napp-k exec postgresql-ha-postgresql-0 -- bash -c "export PGPASSWORD=\$POSTGRES_PASSWORD; psql -d pace -c \"DELETE from workload WHERE id LIKE 'IP-%' AND id ~ '^IP-\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$';\"

$ napp-k exec postgresql-ha-postgresql-0 -- bash -c "export PGPASSWORD=\$POSTGRES_PASSWORD; psql -d pace -c \"DELETE from workload WHERE id IN (SELECT * FROM normalizedcomputeconfig WHERE deleted=true);\"

After applying the cronjob in step 1 and it has had a chance to run (or immediately running step 2), restart workload pod:

% napp-k delete pod workload-<12345678>

Troubleshooting:

If the script fails for some reason, all that needs to be done is to query for valid third-party/clients images on the customer environment (the following command is read-only)

$ napp-k get pods --all-namespaces -o jsonpath="{.items[*].spec['initContainers', 'containers'][*].image}" | tr -s '[[:space:]]' '\n' | sort | uniq | grep third-party/clients
ssp.napp.local/clustering/third-party/clients:23624728
ssp.napp.local/clustering/third-party/clients@sha256:6cbc50d76f0d050423ede6f2bf6a8d9df210be3b9726e1f733fa75e534ce7efb

Now choose one of the images, preferring the sha256 if possible. Replace the image: {{clients_image_url}} in the cronjob_clean_workload.yaml file with the chosen image above, then run the kubectl apply line from the end of the clean_workload.sh script

Attachments

workaroundForWorkloadDB(1).tar.gz get_app