"Recommendation Job Stuck in "Queued For Discovery" or "Failed" State"

Products

VMware vDefend Firewall with Advanced Threat Prevention VMware vDefend Firewall

Issue/Introduction

Started recommendation in Security Intelligence fails to complete and reach "Ready To Publish" state when there are large numbers of Groups, Virtual Machines and Services in NSX Inventory. It stays in "Queued For Discovery" state for more than 60 minutes or it says "Failed".

On large data center deployments where there are

a) many NSX Groups with a lot of Virtual Machine members or

b) many Virtual Machines or

c) many NSX services, sometimes the recommendation job fails to complete .

This has been observed when the scale reaches 7K groups with some groups having 5000 virtual machine members , 15K total number of Virtual Machines and/or 100K services and about 450 million total flows in the database. This large inventory and flows is outside the limit of recommendation jobs' default configuration. However with a few updates to the default recommendation job configurations(as described below), the recommendations can run successfully to completion.

Symptom : Recommendations will be slow to run and will stay in "Queued for Discovery" for a long time. They may also say "FAILED" and the status message tooltip next to the status may show some log messages and exceptions.

Log:
1) napp-k get pods will show something like below with recommendation job driver in error and/or recommendation job executors dying and new executors getting spawned. Below we see that exec-1, exec-2 and exec-3 have been killed an replaced by exec-4,exec-5,exec-6 which are still running.

root@systest-runner:~[508]# napp-k get pods | grep rec-*
rec-2c7221b9ae630d1d18a0fd646a9eb739-driver 1/1 Running 0 18m
rec-2c7221b9ae630d1d18a0fd646a9eb739-exec-4 2/2 Running 0 5m18s
rec-2c7221b9ae630d1d18a0fd646a9eb739-exec-5 2/2 Running 0 3m2s
rec-2c7221b9ae630d1d18a0fd646a9eb739-exec-6 2/2 Running 0 2m6s

2) Recommendation Job driver logs will show below type of errors showing that recommendation spark jobs are failing

napp-k logs -f rec-<some uuid>-driver will show errors like below :

Caused by: org.apache.spark.ExecutorDeadException: The relative remote executor(Id: 13), which maintains the block data to fetch is dead.
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:140)

024-05-16T10:05:31.343ZGMT WARN dispatcher-CoarseGrainedScheduler TaskSetManager - Lost task 27.0 in stage 213.1 (TID 2627) (192.168.12.59 executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason:
The executor with id 1 exited with exit code 137(SIGKILL, possible container OOM).

The API gave the following brief reason: Evicted
The API gave the following message: Pod ephemeral local storage usage exceeds the total limit of containers 2Gi.

The API gave the following container statuses:

container name: executor-corecollector
container image: sha256:037756530c45151b16aac7925ca284b4c5cc14686a88e36eb6701a86a503cc56
container state: terminated
container started at: 2024-05-16T09:49:48Z
container finished at: 2024-05-16T10:06:11Z
exit code: 137
termination reason: Error

container name: spark-kubernetes-executor
container image: nsx-intelligence-ob-docker-local.artifactory.eng.vmware.com/clustering/recommendation-spark-job@sha256:ff49df4ccb8a615c950bae1dcfa864ab27fb9eb8d1a90162c4f536d18297edee
container state: waiting
pending reason: ContainerCreating

2024-05-16T10:05:31.343ZGMT WARN dispatcher-CoarseGrainedScheduler TaskSetManager - Lost task 30.0 in stage 213.1 (TID 2630) (192.168.12.59 executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason:
The executor with id 1 exited with exit code 137(SIGKILL, possible container OOM).
The API gave the following brief reason: Evicted
The API gave the following message: Pod ephemeral local storage usage exceeds the total limit of containers 2Gi.

The API gave the following container statuses:

container name: executor-corecollector
container image: sha256:037756530c45151b16aac7925ca284b4c5cc14686a88e36eb6701a86a503cc56
container state: terminated
container started at: 2024-05-16T09:49:48Z
container finished at: 2024-05-16T10:06:11Z
exit code: 137
termination reason: Error

Environment

NSX-T 4.2.0

Cause

The main reason for this recommendation job failure is that the compute resources provided to the recommendation job to process the flows is insufficient. This can be fixed by either a) reducing the size of the job to be processed for the recommendation and/or b) giving more compute resources to the recommendation job.

Resolution

a) Reducing the size of job to be processed for the recommendation

Please retry the recommendation after trying any/all of the following from recommendation UI based on your use case -

1) shorten the analysis time interval

2) Use port filters to exclude certain ports and port ranges from re-run settings

Use port filters to exclude certain ports and port ranges from Start/re-run settings in the Start/Rerun-recommendation dialog. If there are many flows for example for many individual ports in port range of 40000-50000, excluding the flows in that port range, will reduce the size of the job.

3) Create Services beforehand with wider port ranges and remove unnecessary services

Create Services beforehand with wider port ranges and remove unnecessary services. If there are many flows for example for many individual ports in port range of 40000-50000, creating a single service with port range for 40000-50000 before hand in Inventory, will make the recommendation job to reuse this service, instead of creating a large number of multiple individual services.

4) if applicable, try reducing the number of computes or groups for which you are generating recommendation.

b) Giving more compute resources to recommendation job

To complete the recommendation, recommendation job has to be given more compute resources. To give more compute resources to the recommendation job, perform the following steps.

1) napp-k edit configmap recommendation-app

2) Edit/Reduce flowLimitPerAnalyticsNode from 5 million(default) to 3 million or even smaller

3) Save the configmap and quit editor( Use :wq if using vi editor)

4) Delete the recommendation API server with below command; A new recommendation API server will start and read the changed configmap above.

napp-k delete pod recommendation-xxxxxx-yyyy

5) Re-run the recommendation from the UI. The changed configmap will take effect. Then more spark executors( compute resources) will be launched to finish the job in sometime between 20-35 minutes. For example assume that to process this recommendation required analyzing 14 million flows. With the default configuration before - to process 14 million flows, 3 executors would have been launched (14 million/5 million = 3) , now 5 executors will be launched ( 14 million/3 million = 5) . Since we gave more compute resources for this job, the job is more likely to complete.

6) If the above does not run the recommendation to completion, reduce flowLimitPerAnalyticsNode further ( to say 2 million) and retry again.

Note: Once more compute resources are launched, you may be hitting the NAPP platform capacity limits itself and the start recommendation job itself may rejected or fail immediately due to NAPP platform capacity limits - the status message tooltip in UI will indicate that the capacity exceeded . In such case, adding more nodes to your TKG Cluster or upstream Kubernetes cluster is needed to add more cluster capacity to the NAPP platform. This gives more cluster capacity on which the larger recommendation jobs can run. After adding more nodes, failing recommendations should be rerun again to see if they run to completion. If they cannot still run, Scaleout of the NAPP platform may also be needed (https://docs.vmware.com/en/VMware-NSX/4.1/nsx-application-platform/GUID-8CC7E83F-C59F-4B61-9F4D-F0151ACACD96.html?hWord=N4IghgNiBcIHYHsAmBTAziAvkA).

https://docs.vmware.com/en/VMware-NSX/4.1/nsx-application-platform/GUID-8CC7E83F-C59F-4B61-9F4D-F0151ACACD96.html?hWord=N4IghgNiBcIHYHsAmBTAziAvkA

Additional Information

The above resolution provides the workaround steps. No other workaround is necessary.