NSX Intelligence - New Recommendation jobs will not start because Infra Classifier jobs are stuck in "Pending" state
book
Article ID: 317740
calendar_today
Updated On:
Products
VMware NSX
Issue/Introduction
Symptoms: On NSX Intelligence 4.0.1, Infra Classifier (IC) executors remain in "Pending" state even though they have been spawned hours or days ago. An example of this state is shown below. root@systest-runner:~[501]# kubectl get pod -n nsxi-platform | grep -i infra infraclassifier-875709862a83e8e4-exec-1 2/2 Running 0 24h infraclassifier-875709862a83e8e4-exec-2 0/2 Pending 0 24h infraclassifier-875709862a83e8e4-exec-3 0/2 Pending 0 24h infraclassifier-pod-cleaner-27929190-stpph 0/1 Completed 0 2m19s spark-app-infra-classifier-pyspark-1675750518779665823-driver 2/2 Running 0 24h Upon describing any of the pending executor pods, you will see that they cannot be scheduled because of memory pressure: root@systest-runner:~[501]# kubectl describe pod -n nsxi-platform infraclassifier-875709862a83e8e4-exec-2 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 47s (x22 over 17m) default-scheduler 0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 Insufficient memory. This can be hit if there is just enough resources (~8G) for scheduling as requested by the driver, but then the additional memory pressure from the executors would eventually overflow the resources available across the workers.
Cause
This happens on an on-prem NSX Intelligence Platform setup when there is insufficient memory or CPU to run all the 4 IC executors and driver. The driver might get adequate memory to be scheduled, but the executors are spawned after the driver comes up and is running. If, after scheduling the driver, there's no executor resources available, then the executors' state will show as "Pending". The driver will loop, waiting for executor resources to become available, and these pending executors will sit in front of other pods also awaiting scheduling, including Recommendation and Feature Services.
Resolution
This issue is resolved in VMware NSX-T 4.1.1 (build number 21761692)
The performance of the classifier has been improved such that it does not need as much memory, and there is a built-in memory capacity check before the driver spawns its executors. Until then, the options are to scale the cluster up significantly or suspending the scheduling of the IC sparkapp via: kubectl edit scheduledsparkapp spark-app-infra-classifier-pyspark -n nsxi-platform # insert this line, just above spec.schedule suspend: true The result of suspending the scheduling of IC jobs is that the workload database will keep any classifications it currently has going forward, but will not get any new updates. Users will still be able to specify their own classifications for workloads, but no automated inferences will be made going forward, until there is adequate memory to run the IC job AND the `suspend: true` line is changed to `suspend: false` (or removed entirely).
Workaround: Step 1. Get the sparkapp driver id: kubectl -n nsxi-platform get pods | grep spark-app-infra-classifier-pyspark
Step 2. Delete the IC driver pod: kubectl delete pod -n nsxi-platform spark-app-infra-classifier-pyspark-XXXXXXXXXXX-driver Step 3. (May not be necessary) Delete the sparkapp for the current run. kubectl get sparkapp -n nsxi-platform | grep infra Find the entry that looks like "spark-app-infra-classifier-pyspark-XXXXXXXXX" and delete it: kubectl delete sparkapp spark-app-infra-classifier-pyspark-XXXXXXXXXXX -n nsxi-platform Note: Sometimes just cleaning up the driver can be enough, but it is important to issue the kubectl get sparkapp command and to delete any other dangling IC sparkapps, if there are any. Step 4. Next, suspend the scheduling of the IC sparkapp via: kubectl edit scheduledsparkapp spark-app-infra-classifier-pyspark -n nsxi-platform # insert this line, just above spec.schedule suspend: true If significant memory is added to the cluster via scale up/out, IC may be re-enabled.
Additional Information
Impact/Risks: New Recommendation jobs will not start because the IC jobs are stuck, and spark-operator is waiting for them to complete.