503 Service Unavailable Exception seen while running recommendations or in the notifications alert
search cancel

503 Service Unavailable Exception seen while running recommendations or in the notifications alert

book

Article ID: 378826

calendar_today

Updated On:

Products

VMware vDefend Firewall with Advanced Threat Prevention VMware vDefend Firewall

Issue/Introduction

You are running NSX Application Platform (NAPP) and you are unable to select virtual machines in the Start Recommendations dialog due to 503 Service Unavailable Exception.
You may also see an alarm notification as seen here : 

Environment

NAPP 4.2 and prior

Cause

Workload service is responsible for serving the apis to return the virtual machines that are classified as Infrastructure virtual machines. These classifications are stored in the workload database by the Infra classifier service. Infra classifier will process the flows received from the hosts to determine if the source or destination is running Infrastructure service. If there are a large number of private IPs in the flows received from the host, where the IPs do not belong to any managed NSX entities, then, it is possible that several of these IPs are also written into the workload database. Eventually the size of the database can get large and cause Out of Memory exception in the workload service while attempting to retrieve the classified virtual machines.

Identify if workload pod currently is experiencing OutOfMemory exceptions via running the command on NSX manager using root user credentials

% napp-k get pods | grep workload

% napp-k logs workload-<12345678>

(replace the string in <> with the pod name seen in the setup)

 

...
2024-09-23T00:00:32.39322057Z stderr F Exception in thread \"https-jsse-nio-7669-Poller\" java.lang.OutOfMemoryError: Java heap space"
...

Resolution

Use these steps if you are interested in preserving the functionality for Infra Classifier service. This service is useful to classify workloads as infrastructure machines and can be excluded from consideration in recommendation analysis. If you frequently activate this option while using recommendation, then, you can follow the steps here to fix the infra classifier service functionality and prevent out of memory condition.

If you keep the default setting for deactivating exclude infrastructure workloads, then you can follow steps in this KB to workaround the issue by disabling Infra Classifier service  : https://knowledge.broadcom.com/external/article?articleNumber=378823


Create a background job that runs once a day (4:00 am), to unconditionally remove any unmanaged internal computes, e.g. IP-123.45.67.8, from workload db. Additionally remove any vms deleted from inventory which could also add load to workload service.

  1. Note: the below steps should be run on the NSX manager with root access. If you are unable to run it on the manager then you can run it on any system where you have access to the kubeconfig file to access the cluster
    Download the attachment workaroundForWorkloadDB.tar.gz, and do the following to apply the patch

    • Verify the integrity of file

      % md5sum workaroundForWorkloadDB.tar.gz

      a5ea7a817c09066796b8a27425db24db  workaroundForWorkloadDB.tar.gz

       

       

    • Extract the tar

      tar zxvf workaroundForWorkloadDB.tar.gz
      x cleanDb/
      x cleanDb/clean_workload.sh
      x cleanDb/cronjob_clean_workload.yaml

    • Verify the owner is root:

      cd cleanDb                                                               
      cleanDb % ls -l                                                                                  
      total 16
      -rwxrwxr-x@ 1 root  wheel  1710 Sep 16 15:15 cronjob_clean_workload.yaml
      -rwxrwxr-x@ 1 root  wheel  2075 Oct  7 11:23 clean_workload.sh

    • Locate your kubeconfig file

      alias grep napp-k
      alias napp-k='kubectl --kubeconfig <Your Kubeconfig File> -n nsxi-platform'

    • Run the script, if there are issues with interpretting the bash shebang due to putty or windows hosts, try the alternative command

      $ ./clean_workload.sh --kubeconfig=<Your Kubeconfig File>OR
      bash clean_workload.sh --kubeconfig=<Your Kubeconfig File>

      Upon successfuly execution, your terminal should print something like below

      $ ./clean_workload.sh --kubeconfig=".kube/config"
      Using cronjob spec cronjob_clean_workload.yaml
      clients images found: ssp.napp.local/clustering/third-party/clients:23624728
      ssp.napp.local/clustering/third-party/clients@sha256:6cbc50d76f0d050423ede6f2bf6a8d9df210be3b9726e1f733fa75e534ce7efb
      Selected clients image: ssp.napp.local/clustering/third-party/clients@sha256:6cbc50d76f0d050423ede6f2bf6a8d9df210be3b9726e1f733fa75e534ce7efb
      Replacement successful
      cronjob.batch/clean-workload configured
      Successfully created cronjob

      Once the background periodic job is created, you may need to wait for up to 24 for the cronjob to run. Refer to step 2 to immediately clean db.

  2. Optional: To immediately address the issue without waiting for the cronjob detailed in later steps to trigger, use the following command to immediately clean up the entries:

    $ napp-k exec postgresql-ha-postgresql-0 -- bash -c "export PGPASSWORD=\$POSTGRES_PASSWORD; psql -d pace -c \"DELETE from workload
        WHERE id LIKE 'IP-%'
          AND id ~ '^IP-\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$';\
    "

     

    $ napp-k exec postgresql-ha-postgresql-0 -- bash -c "export PGPASSWORD=\$POSTGRES_PASSWORD; psql -d pace -c \"DELETE from workload WHERE id IN (SELECT * FROM normalizedcomputeconfig WHERE deleted=true);\"

  3.  After applying the cronjob in step 1 and it has had a chance to run (or immediately running step 2), restart workload pod:

    % napp-k delete pod workload-<12345678>


    Troubleshooting:

    • If the script fails for some reason, all that needs to be done is to query for valid third-party/clients images on the customer environment (the following command is read-only)

       

      $ napp-k get pods --all-namespaces -o jsonpath="{.items[*].spec['initContainers', 'containers'][*].image}" tr -s '[[:space:]]' '\n' sort uniq grep third-party/clients
      ssp.napp.local/clustering/third-party/clients:23624728
      ssp.napp.local/clustering/third-party/clients@sha256:6cbc50d76f0d050423ede6f2bf6a8d9df210be3b9726e1f733fa75e534ce7efb

    • Now choose one of the images, preferring the sha256 if possible. Replace the image: {{clients_image_url}}   in the cronjob_clean_workload.yaml file with the chosen image above, then run the kubectl apply line from the end of the clean_workload.sh script

Attachments

workaroundForWorkloadDB(1).tar.gz get_app