NSX Intelligence 4.0.1.1 and 4.1.0 running on NAPP
Recommendation generation is stuck in waiting state
Executors for the infraclassifier show as "Pending" and the time they have been up is very long (on the order of hours or days). For example:
root@systest-runner:~[501]# kubectl get pod -n nsxi-platform | grep -i infra infraclassifier-875709862a83e8e4-exec-1 2/2 Running 0 24h infraclassifier-875709862a83e8e4-exec-2 0/2 Pending 0 24h infraclassifier-875709862a83e8e4-exec-3 0/2 Pending 0 24h
Environment
VMware NSX 4.0.0.1
Cause
When on-prem NSXi Platform setup does not have sufficient memory or CPU to run the all 4 IC executors + driver, (which in scale, used to request approx 48G memory across the available workers) the driver will usually have adequate memory to schedule, but executors get spawned after the driver is up and running. If, after scheduling the driver, there's no executor resources available, then the executors' state will show as "Pending". The driver will loop, waiting for executor resources to become available, and these pending executors will sit in front of other pods awaiting scheduling, including Rec and Feature Service.
Resolution
This is a known issue, currently there is no resolution.
Workaround: The workaround is to scale the cluster up significantly or suspend the scheduling of the IC sparkapp via these commands:
# insert this line, just above spec.schedule suspend: "true"
The result of suspending scheduling of the IC jobs is that the workload database will keep any classifications it currently has going forward, but will not get any new updates. It will still be possible to specify your own classifications for workloads, but no automated inferences will be made going forward, until there is adequate memory to run the IC job AND the `suspend: "true"` line is changed to `suspend: "false"` (or removed entirely).
Delete the IC driver pod and sparkapp for the current run using this command:
kubectl delete pod -n nsxi-platform spark-app-infra-classifier-pyspark-1675750518779665823-driver
kubectl get sparkapp -n nsxi-platform | grep infra
# find the entry that looks like `spark-app-infra-classifier-pyspark-XXXXXXXXX`