SSP: Recommendation jobs may fail to start or the jobs may fail with message 'java heap space' after job is started

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Recommendation Jobs cause internally to launch spark driver and executor pods in a Kubernetes cluster. The Driver Kubernetes pod goes out of memory when the job creates a large payload.

Environment

SSP 5.1.1

Cause

1)Recommendation Jobs are run on a cluster which launches Spark Driver and Executor pods. A Driver pod is given 4G of memory by default. When a Group or Application on which recommendation is run contains a VM, which makes flows on a large number of ports , say 500+( this is typical of Scanner type VMs), then the recommendation job tries to create many services, rules and possibly groups to publish to NSX Manager. This payload may be too large to collect in the driver and the driver goes out of memory.

After running the job, the job will fail in UI with status message saying 'Java heap space' ; Job may take too long to run ( typically does not complete in 45 minutes)
Verify the Log of recommendation job by following below steps

Login to SSPI cli via sysadmin user credentials and Execute below commands to verify the logs

k -n nsxi-platform get pods | grep driver >> This will show  the recommendation job driver pod named rec-xxxx-yyy-driver

k -n nsxi-platform  logs   -f  <Pod named rec-xxxx-yyy-driver> from above  will show  java.lang.OutOfMemoryError and will say "Dumping heap" to /var/log/cores/rec-driver

Resolution

Since the job is failing due to one or more VM communicating on excessive number of ports, the steps below will center around removing the VM from boundary, reducing the duration of the flows to analyze or scaling the resources for the job.

1) Add scanner exclusions if there is any scanner in the environment that could be spamming the application in the context boundary (only works if guardrail service is not blocking it and spark job is failing)

For details on how to add computes to the exclusion list, please see the public documentation: Managing the Suspicious Traffic Detector Definitions

Examine the Group or Application Members from Inventory page.
Narrow the VM making flows on large number of ports by using Security Explorer.
Add the VM to exclusion list. To do this, Go to Monitor and Plan→General Settings→Source Exclusions List and then add click "Edit" to add the IP address of the VM.
This removes the VM from the flow analysis for all recommendations.

OR

2) Reduce the time duration of the recommendation job

When recommendation is started , you can select the duration of the flows to analyze. Change this to a smaller value than what was used when the recommendation was first ran.

OR

3) reduce the context boundary

If possible remove the VM form this group or application, before starting the recommendation.

OR

4) Increase Spark job driver and or executor memory

To do this do the following to change those values and restart recommendation server. Login to SSPI using sysadmin credentals and perform steps as below:

k - n nsxi-platform edit cm recommendation-app

Change defaultProfileDriverMemory and scaleOutProfileDriverMemory from 4g to 6g

snippet for reference:

k -n nsxi-platform | grep recommendation; Find the recommendation api server pod

sample output for reference:

recommendation-744c6df49f-mvbs8 2/2 Running 0 28h

recommendation-clean-up-cronjob-29514255-9x2m7 0/1 Completed 0 51s

recommendation-continuous-monitoring-cronjob-29514255-n7rqg 0/1 Completed 0 51s

recommendation-monitor-failed-spark-job-cronjob-29514240-b8b8z 0/1 Completed 0 15m

k delete pod recommendation-744c6df49f-mvbs8 ( pod name taken for the reference from above sample output)

A new recommnedation api server will start and next time you start recommendation, the job will start with more resources.

In this case, scanner exclusion and driver memory increase helped to resolve the issue.