Flows are not seen in the Plan and Troubleshoot view in NSX Intelligence 3.2.1 or 4.0.1.
Even though the status of spark-app-rawflow-driver and spark-app-overflow-driver shows as 'Running', the app is not processing flows.
This can be determined by checking the following logs. Please log into the NSX Manager as root user and run the following commands:napp-k logs spark-app-rawflow-driver -c spark-kubernetes-drivernapp-k logs spark-app-overflow-driver -c spark-kubernetes-driver
If you see logs with this exception, then the flow processing app is in an error state and needs to be restarted:
ERROR JobScheduler JobScheduler - Error in job generatorjava.lang.IllegalStateException: JobGenerator has already been stopped accidentally. at org.apache.spark.util.EventLoop.post(EventLoop.scala:107) at org.apache.spark.streaming.scheduler.JobGenerator.$anonfun$timer$1(JobGenerator.scala:63) at org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:94) at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106) at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29)
Vmware NSX
This issue is hit whenever there is a kafka outage for a long period or when the minio cluster is not available for read/write. The minio cluster not being available for write can occur due to a known issue in 4.0.1
This issue is resolved in NSX Intelligence 4.1.1.
Workaround:
Restart the driver pods that have the above error in the logs. If both spark-app-rawflow-driver and spark-app-overflow-driver pods have the same error, then delete both. They will be automatically restarted by Kubernetes.
napp-k delete pod spark-app-rawflow-drivernapp-k delete pod spark-app-overflow-driver
Wait about 5-10 minutes for the pods to start again and ensure that you do not see the same error in the logs.
If you do see the same error then check the status of kafka and minio services. If both services are running, then check if minio disk is full using KB 91696 . Once minio disk is cleaned up according to the instructions in the linked KB, please restart the driver pods again.
Impact/Risks:
Users will be unable to see any flow records received by NSX Intelligence after this error was encountered. No new traffic can be seen in the UI and hence no recommendations can be generated for this new traffic.