You are running NSX Application Platform (NAPP) and you are unable to select virtual machines in the Start Recommendations dialog due to 503 Service Unavailable Exception.
You may also see an alarm notification as seen here :
NAPP 4.2 and prior
Workload service is responsible for serving the apis to return the virtual machines that are classified as Infrastructure virtual machines. These classifications are stored in the workload database by the Infra classifier service. Infra classifier will process the flows received from the hosts to determine if the source or destination is running Infrastructure service. If there are a large number of private IPs in the flows received from the host, where the IPs do not belong to any managed NSX entities, then, it is possible that several of these IPs are also written into the workload database. Eventually the size of the database can get large and cause Out of Memory exception in the workload service while attempting to retrieve the classified virtual machines.
Identify if workload pod currently is experiencing OutOfMemory exceptions via running the command on NSX manager using root user credentials
% napp-k get pods | grep workload
% napp-k logs workload-<12345678>
(replace the string in <> with the pod name seen in the setup)
...
2024-09-23T00:00:32.39322057Z stderr F Exception in thread \"https-jsse-nio-7669-Poller\" java.lang.OutOfMemoryError: Java heap space"
...
Use these steps if you are interested in preserving the functionality for Infra Classifier service. This service is useful to classify workloads as infrastructure machines and can be excluded from consideration in recommendation analysis. If you frequently activate this option while using recommendation, then, you can follow the steps here to fix the infra classifier service functionality and prevent out of memory condition.
If you keep the default setting for deactivating exclude infrastructure workloads, then you can follow steps in this KB to workaround the issue by disabling Infra Classifier service : https://knowledge.broadcom.com/external/article?articleNumber=378823
Create a background job that runs once a day (4:00 am), to unconditionally remove any unmanaged internal computes, e.g. IP-123.45.67.8, from workload db. Additionally remove any vms deleted from inventory which could also add load to workload service.
Note: the below steps should be run on the NSX manager with root access. If you are unable to run it on the manager then you can run it on any system where you have access to the kubeconfig file to access the cluster
Download the attachment workaroundForWorkloadDB.tar.gz, and do the following to apply the patch
Verify the integrity of file
|
Extract the tar
|
Verify the owner is root:
|
Locate your kubeconfig file
|
Run the script, if there are issues with interpretting the bash shebang due to putty or windows hosts, try the alternative command
|
Upon successfuly execution, your terminal should print something like below
|
Once the background periodic job is created, you may need to wait for up to 24 for the cronjob to run. Refer to step 2 to immediately clean db.
Optional: To immediately address the issue without waiting for the cronjob detailed in later steps to trigger, use the following command to immediately clean up the entries:
|
After applying the cronjob in step 1 and it has had a chance to run (or immediately running step 2), restart workload pod:
|
|
clients_image_url}}
in the cronjob_clean_workload.yaml file with the chosen image above, then run the kubectl apply line from the end of the clean_workload.sh script