Symptoms:
NSX Application Platform status shows "Degraded" in the NSX Manager UI -> System -> NSX Application Platform tab.
When you SSH into NSX Manager CLI using root credentials and execute the below:
napp-k get pods | grep spark-
You will notice the spark-app-overflow-driver, the spark-app-rawflow-driver, or the spark-app-context-driver (either or all of them) are stuck in Init:Error state.
NAPP 4.1.x, 4.2.x
This is a known issue with Spark submit code for Kubernetes resource manager, where it creates the driver pod before creating the config maps that the driver needs to mount. So, a handful of times (depending on hardware, timing, and responsiveness), the pod creation would fail since the config map to be mounted has not even been created yet.
Option 1:
Restart the Spark operator pod:
- From the NSX manager CLI with root access, find the spark-operator pod using:
napp-k get pods | grep spark-operator
- Copy the pod name from the above output and execute the below:
napp-k delete pod <spark-operator-xxxx-copied-from-above>
Check if the driver pods are coming back to the "Running" state by executing:
napp-k get pods | grep spark-app
If they are still not in the Running state, please follow the option below:
Option 2:
Delete NSX Intelligence and Network Detection Features from the NSX Manager -> NSX Application Platform tab
Activate them back.
For more info on how to disable and re-enable, please refer to the documents below: