NSX Intelligence - New Recommendation jobs will not start because Infra Classifier jobs are stuck in "Pending" state

Products

VMware NSX

Issue/Introduction

Symptoms:
On NSX Intelligence 4.0.1, Infra Classifier (IC) executors remain in "Pending" state even though they have been spawned hours or days ago. An example of this state is shown below.
root@systest-runner:~[501]# kubectl get pod -n nsxi-platform | grep -i infra
infraclassifier-875709862a83e8e4-exec-1                         2/2     Running     0          24h
infraclassifier-875709862a83e8e4-exec-2                         0/2     Pending     0          24h
infraclassifier-875709862a83e8e4-exec-3                         0/2     Pending     0          24h
infraclassifier-pod-cleaner-27929190-stpph                      0/1     Completed   0          2m19s
spark-app-infra-classifier-pyspark-1675750518779665823-driver   2/2     Running     0          24h

Upon describing any of the pending executor pods, you will see that they cannot be scheduled because of memory pressure:
root@systest-runner:~[501]# kubectl describe pod -n nsxi-platform infraclassifier-875709862a83e8e4-exec-2
Events:
Type     Reason            Age                 From               Message
----     ------            ----                ----               -------
Warning FailedScheduling 47s (x22 over 17m) default-scheduler 0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 Insufficient memory.

This can be hit if there is just enough resources (~8G) for scheduling as requested by the driver, but then the additional memory pressure from the executors would eventually overflow the resources available across the workers.

Cause

This happens on an on-prem NSX Intelligence Platform setup when there is insufficient memory or CPU to run all the 4 IC executors and driver. The driver might get adequate memory to be scheduled, but the executors are spawned after the driver comes up and is running. If, after scheduling the driver, there's no executor resources available, then the executors' state will show as "Pending". The driver will loop, waiting for executor resources to become available, and these pending executors will sit in front of other pods also awaiting scheduling, including Recommendation and Feature Services.

Resolution

This issue is resolved in VMware NSX-T 4.1.1 (build number 21761692)

The performance of the classifier has been improved such that it does not need as much memory, and there is a built-in memory capacity check before the driver spawns its executors.

Until then, the options are to scale the cluster up significantly or suspending the scheduling of the IC sparkapp via:
    kubectl edit scheduledsparkapp spark-app-infra-classifier-pyspark -n nsxi-platform
    # insert this line, just above spec.schedule
      suspend: true

The result of suspending the scheduling of IC jobs is that the workload database will keep any classifications it currently has going forward, but will not get any new updates. Users will still be able to specify their own classifications for workloads, but no automated inferences will be made going forward, until there is adequate memory to run the IC job AND the `suspend: true` line is changed to `suspend: false` (or removed entirely).

Workaround:
Step 1. Get the sparkapp driver id:
kubectl -n nsxi-platform get pods | grep spark-app-infra-classifier-pyspark

Step 2. Delete the IC driver pod:
   kubectl delete pod -n nsxi-platform spark-app-infra-classifier-pyspark-XXXXXXXXXXX-driver

Step 3. (May not be necessary) Delete the sparkapp for the current run.
   kubectl get sparkapp -n nsxi-platform | grep infra
   Find the entry that looks like "spark-app-infra-classifier-pyspark-XXXXXXXXX" and delete it:
   kubectl delete sparkapp spark-app-infra-classifier-pyspark-XXXXXXXXXXX -n nsxi-platform

Note: Sometimes just cleaning up the driver can be enough, but it is important to issue the kubectl get sparkapp command and to delete any other dangling IC sparkapps, if there are any.

Step 4. Next, suspend the scheduling of the IC sparkapp via:
    kubectl edit scheduledsparkapp spark-app-infra-classifier-pyspark -n nsxi-platform
    # insert this line, just above spec.schedule
      suspend: true

    If significant memory is added to the cluster via scale up/out, IC may be re-enabled.

Additional Information

Impact/Risks:
New Recommendation jobs will not start because the IC jobs are stuck, and spark-operator is waiting for them to complete.