When SSP experiences a disruption (for example, due to NTP drift, storage disruption, or similar infrastructure issues), recovery and post-recovery validation can be challenging.
This KB describes a specific scenario where no network flows are ingested and no alarms are generated in SSP after the cluster is restored, even though the SSP cluster appears healthy. This article applies specifically to situations after SSP recovery and only when no other errors or warnings are present (such as metrics delivery failures). If additional errors are observed, refer to other relevant KB articles.
After SSP recovery, the following symptoms may be observed in the UI and host statistics:
System > Overview > Flow Processing Capacity
System > Platform & Features > Metrics > Flow Ingestion
Home > Security Explorer
Monitor & Plan > Overview > Security Explorer > Overview > Flow Trends
Monitor & Plan > Security Explorer
No alarms are generated in SSP.
SSP cluster health shows no reported issues.
Transport nodes are sending metrics as expected.
Check the Spark rawflow-related pods and verify they are running:
Even though the pods are running, flow processing may still be stalled.
If the following command shows COMPLETE_FLOWS count as 0, it indicates that the rawflow driver is not processing flows:
Check detailed logs from the Spark driver container:
Kafka connection failures (for example, to kafka-1) are observed across Spark rawflow pods.This indicates Kafka connectivity issues affecting the Spark rawflow application.
This indicates Kafka connectivity issues affecting the Spark rawflow application.
SSP 5.0.0
This is the most common version where the issue occurs.
SSP 5.1.0
Improvements to the Spark rawflow driver reduce the likelihood of encountering this issue.
The issue occurs when one of the threads in the spark-app-rawflow-driver becomes blocked indefinitely.
Key details:
The Spark rawflow driver has two threads:
rawflow_processing_query
rawflow_correlation_query
In this scenario, the rawflow_processing_query thread becomes stuck, preventing flow processing.
Kafka termination or connectivity issues cause the Spark application to enter a degraded state.
Kafka connection failures (for example, to kafka-1) are observed across Spark rawflow pods.
Additional contributing factors:
Queries and API calls without timeouts can hang indefinitely.
QueryHealthMonitor incorrectly flagged queries as “stalled” while they were waiting for data.
Restart the Spark rawflow driver pod to restore flow processing.
Log in to the active SSPI with "admin(SSPI5.0) | sysadmin(SSPI5.1)" access.
Delete the rawflow driver pod:
Wait for the pod to be recreated and verify its status:
(Confirm that spark-app-rawflow-driver is running.)
Once the pod restarts, flow ingestion and UI visibility should resume.
Reference KB:
NOTE: Issue will be fixed in further releases