NSX Recommendation is stuck in waiting state
search cancel

NSX Recommendation is stuck in waiting state

book

Article ID: 324171

calendar_today

Updated On: 04-06-2023

Products

VMware NSX

Issue/Introduction

Symptoms:
  • NSX Intelligence 4.0.1.1 and 4.1.0 running on NAPP
  • Recommendation generation is stuck in waiting state
  • Executors for the infraclassifier show as "Pending" and the time they have been up is very long (on the order of hours or days). For example:
root@systest-runner:~[501]# kubectl get pod -n nsxi-platform | grep -i infra
infraclassifier-875709862a83e8e4-exec-1 2/2 Running 0 24h
infraclassifier-875709862a83e8e4-exec-2 0/2 Pending 0 24h
infraclassifier-875709862a83e8e4-exec-3 0/2 Pending 0 24h


Environment

VMware NSX 4.0.0.1

Cause

When on-prem NSXi Platform setup does not have sufficient memory or CPU to run the all 4 IC executors + driver, (which in scale, used to request approx 48G memory across the available workers) the driver will usually have adequate memory to schedule, but executors get spawned after the driver is up and running. If, after scheduling the driver, there's no executor resources available, then the executors' state will show as "Pending". The driver will loop, waiting for executor resources to become available, and these pending executors will sit in front of other pods awaiting scheduling, including Rec and Feature Service.

Resolution

This is a known issue, currently there is no resolution.

Workaround:
The workaround is to scale the cluster up significantly or suspend the scheduling of the IC sparkapp via these commands:

  kubectl edit scheduledsparkapp spark-app-infra-classifier-pyspark

  # insert this line, just above spec.schedule
  suspend: "true"

The result of suspending scheduling of the IC jobs is that the workload database will keep any classifications it currently has going forward, but will not get any new updates. It will still be possible to specify your own classifications for workloads, but no automated inferences will be made going forward, until there is adequate memory to run the IC job AND the `suspend: "true"` line is changed to `suspend: "false"` (or removed entirely).

Delete the IC driver pod and sparkapp for the current run using this command:

kubectl delete pod -n nsxi-platform spark-app-infra-classifier-pyspark-1675750518779665823-driver

kubectl get sparkapp -n nsxi-platform | grep infra

# find the entry that looks like `spark-app-infra-classifier-pyspark-XXXXXXXXX`

kubectl delete sparkapp spark-app-infra-classifier-pyspark-XXXXXXXX -n nsxi-platform

Additional Information

Impact/Risks:
No Intelligence Recommendation can be generated successfully