SSP has no flow data after SSP recovered from the disruption
search cancel

SSP has no flow data after SSP recovered from the disruption

book

Article ID: 419280

calendar_today

Updated On:

Products

VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

When SSP experiences a disruption (for example, due to NTP drift, storage disruption, or similar infrastructure issues), recovery and post-recovery validation can be challenging.

This KB describes a specific scenario where no network flows are ingested and no alarms are generated in SSP after the cluster is restored, even though the SSP cluster appears healthy. This article applies specifically to situations after SSP recovery and only when no other errors or warnings are present (such as metrics delivery failures). If additional errors are observed, refer to other relevant KB articles.


Symptoms

After SSP recovery, the following symptoms may be observed in the UI and host statistics:

No Network Flow Data in UI

When Intelligence Security is NOT enabled:

  • System > Overview > Flow Processing Capacity

  • System > Platform & Features > Metrics > Flow Ingestion

When Intelligence Security IS enabled:

  • Home > Security Explorer

  • Monitor & Plan > Overview > Security Explorer > Overview > Flow Trends

  • Monitor & Plan > Security Explorer

Alarms and Cluster Health

  • No alarms are generated in SSP.

  • SSP cluster health shows no reported issues.

Transport Node Metrics

  • Transport nodes are sending metrics as expected.


Initial Validation

Check the Spark rawflow-related pods and verify they are running:

root@v111-4:~# k -n get po | grep rawflow 
rawflowcorrelator-5a651c9506166031-exec-1 1/1 Running 0 75d 
rawflowcorrelator-5a651c9506166031-exec-2 1/1 Running 0 75d 
rawflowcorrelator-5a651c9506166031-exec-3 1/1 Running 0 75d 
spark-app-rawflow-driver 1/1 Running 0 75d

Even though the pods are running, flow processing may still be stalled.


Further Troubleshooting

If the following command shows COMPLETE_FLOWS count as 0, it indicates that the rawflow driver is not processing flows:

k logs spark-app-rawflow-driver -n nsxi-platform | grep COMPLETE_FLOWS

Check detailed logs from the Spark driver container:

k logs spark-app-rawflow-driver -c spark-kubernetes-driver -n nsxi-platform

Kafka connection failures (for example, to kafka-1) are observed across Spark rawflow pods.This indicates Kafka connectivity issues affecting the Spark rawflow application.

This indicates Kafka connectivity issues affecting the Spark rawflow application.

Environment

 

  • SSP 5.0.0
    This is the most common version where the issue occurs.

  • SSP 5.1.0
    Improvements to the Spark rawflow driver reduce the likelihood of encountering this issue.

 

 

Cause

The issue occurs when one of the threads in the spark-app-rawflow-driver becomes blocked indefinitely.

Key details:

  • The Spark rawflow driver has two threads:

    • rawflow_processing_query

    • rawflow_correlation_query

  • In this scenario, the rawflow_processing_query thread becomes stuck, preventing flow processing.

  • Kafka termination or connectivity issues cause the Spark application to enter a degraded state.

  • Kafka connection failures (for example, to kafka-1) are observed across Spark rawflow pods.

Additional contributing factors:

  • Queries and API calls without timeouts can hang indefinitely.

  • QueryHealthMonitor incorrectly flagged queries as “stalled” while they were waiting for data.

Resolution

Workaround

Restart the Spark rawflow driver pod to restore flow processing.

Steps:

  1. Log in to the active SSPI with "admin(SSPI5.0) | sysadmin(SSPI5.1)" access.

  2. Delete the rawflow driver pod:

     
    k delete pod spark-app-rawflow-driver -n nsxi-platform
  3. Wait for the pod to be recreated and verify its status:

     
    k get pods -A | grep rawflow

    (Confirm that spark-app-rawflow-driver is running.)

Once the pod restarts, flow ingestion and UI visibility should resume.

 

Additional Information

Reference KB: 

NOTE: Issue will be  fixed in further releases