Recommendation Jobs cause internally to launch spark driver and executor pods in a Kubernetes cluster. The Driver Kubernetes pod goes out of memory when the job creates a large payload.
SSP 5.1.1
1)Recommendation Jobs are run on a cluster which launches Spark Driver and Executor pods. A Driver pod is given 4G of memory by default. When a Group or Application on which recommendation is run contains a VM, which makes flows on a large number of ports , say 500+( this is typical of Scanner type VMs), then the recommendation job tries to create many services, rules and possibly groups to publish to NSX Manager. This payload may be too large to collect in the driver and the driver goes out of memory.
Login to SSPI cli via sysadmin user credentials and Execute below commands to verify the logs
k -n nsxi-platform get pods | grep driver >> This will show the recommendation job driver pod named rec-xxxx-yyy-driverk -n nsxi-platform logs -f <Pod named rec-xxxx-yyy-driver> from above will show java.lang.OutOfMemoryError and will say "Dumping heap" to /var/log/cores/rec-driver
Since the job is failing due to one or more VM communicating on excessive number of ports, the steps below will center around removing the VM from boundary, reducing the duration of the flows to analyze or scaling the resources for the job.
1) Add scanner exclusions if there is any scanner in the environment that could be spamming the application in the context boundary (only works if guardrail service is not blocking it and spark job is failing)
For details on how to add computes to the exclusion list, please see the public documentation: Managing the Suspicious Traffic Detector Definitions
OR
2) Reduce the time duration of the recommendation job
When recommendation is started , you can select the duration of the flows to analyze. Change this to a smaller value than what was used when the recommendation was first ran.
OR
3) reduce the context boundary
If possible remove the VM form this group or application, before starting the recommendation.
OR
4) Increase Spark job driver and or executor memory
To do this do the following to change those values and restart recommendation server. Login to SSPI using sysadmin credentals and perform steps as below:
k - n nsxi-platform edit cm recommendation-appChange defaultProfileDriverMemory and scaleOutProfileDriverMemory from 4g to 6g
snippet for reference:
k -n nsxi-platform | grep recommendation; Find the recommendation api server pod
sample output for reference:
recommendation-744c6df49f-mvbs8 2/2 Running 0 28h
recommendation-clean-up-cronjob-29514255-9x2m7 0/1 Completed 0 51s
recommendation-continuous-monitoring-cronjob-29514255-n7rqg 0/1 Completed 0 51s
recommendation-monitor-failed-spark-job-cronjob-29514240-b8b8z 0/1 Completed 0 15m
k delete pod recommendation-744c6df49f-mvbs8 ( pod name taken for the reference from above sample output)
A new recommnedation api server will start and next time you start recommendation, the job will start with more resources.
In this case, scanner exclusion and driver memory increase helped to resolve the issue.