SSP has no flow data after SSP recovered from the disruption

Products

VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

When SSP experiences a disruption (for example, due to NTP drift, storage disruption, or similar infrastructure issues), recovery and post-recovery validation can be challenging.

This KB describes a specific scenario where no network flows are ingested and no alarms are generated in SSP after the cluster is restored, even though the SSP cluster appears healthy. This article applies specifically to situations after SSP recovery and only when no other errors or warnings are present (such as metrics delivery failures). If additional errors are observed, refer to other relevant KB articles.

Symptoms

After SSP recovery, the following symptoms may be observed in the UI and host statistics:

No Network Flow Data in UI

When Intelligence Security is NOT enabled:

System > Overview > Flow Processing Capacity
System > Platform & Features > Metrics > Flow Ingestion

When Intelligence Security IS enabled:

Home > Security Explorer
Monitor & Plan > Overview > Security Explorer > Overview > Flow Trends
Monitor & Plan > Security Explorer

Alarms and Cluster Health

No alarms are generated in SSP.
SSP cluster health shows no reported issues.

Transport Node Metrics

Transport nodes are sending metrics as expected.

Initial Validation

Check the Spark rawflow-related pods and verify they are running:

root@v111-4:~# k -n get po | grep rawflow 
rawflowcorrelator-5a651c9506166031-exec-1 1/1 Running 0 75d 
rawflowcorrelator-5a651c9506166031-exec-2 1/1 Running 0 75d 
rawflowcorrelator-5a651c9506166031-exec-3 1/1 Running 0 75d 
spark-app-rawflow-driver 1/1 Running 0 75d

Even though the pods are running, flow processing may still be stalled.

Further Troubleshooting

If the following command shows COMPLETE_FLOWS count as 0, it indicates that the rawflow driver is not processing flows:

Check detailed logs from the Spark driver container:

Kafka connection failures (for example, to kafka-1) are observed across Spark rawflow pods.This indicates Kafka connectivity issues affecting the Spark rawflow application.

This indicates Kafka connectivity issues affecting the Spark rawflow application.

Environment

SSP 5.0.0
This is the most common version where the issue occurs.
SSP 5.1.0
Improvements to the Spark rawflow driver reduce the likelihood of encountering this issue.

Cause

The issue occurs when one of the threads in the spark-app-rawflow-driver becomes blocked indefinitely.

Key details:

The Spark rawflow driver has two threads:
- rawflow_processing_query
- rawflow_correlation_query
In this scenario, the rawflow_processing_query thread becomes stuck, preventing flow processing.
Kafka termination or connectivity issues cause the Spark application to enter a degraded state.
Kafka connection failures (for example, to kafka-1) are observed across Spark rawflow pods.

Additional contributing factors:

Queries and API calls without timeouts can hang indefinitely.
QueryHealthMonitor incorrectly flagged queries as “stalled” while they were waiting for data.

Resolution

Workaround

Restart the Spark rawflow driver pod to restore flow processing.

Steps:

Log in to the active SSPI with "admin(SSPI5.0) | sysadmin(SSPI5.1)" access.
Delete the rawflow driver pod:

k delete pod spark-app-rawflow-driver -n nsxi-platform
Wait for the pod to be recreated and verify its status:

k get pods -A | grep rawflow

(Confirm that spark-app-rawflow-driver is running.)

Once the pod restarts, flow ingestion and UI visibility should resume.

Additional Information

Reference KB:

https://knowledge.broadcom.com/external/article?articleNumber=419290 : This KB root cause is similar and high level behavior is similar. However the impacted pods are different and where the flow is broke different.
https://knowledge.broadcom.com/external/article?articleNumber=311849

NOTE: Issue will be fixed in further releases