No Flow - No Alarm in SSP

Products

VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

There are several possible reasons why an SSP cluster may not show flow data. In many cases, you may see alarms in NSX or SSP such as “metrics delivery failure” or “transport node disconnection.” These typically indicate issues on the source side.

If no alarms are present but metrics still do not appear in SSP, the issue is likely on the SSP side. In this article, we will focus on a common SSP-side problem where the rawflow-driver pods fail to start, which prevents flow data from being ingested. This behavior is most frequently observed in SSP 5.0.

You can notice this issue from various locations in the UI, but the following pages provide the clearest signs:

This patterns are showing in other places as well, such as Monitoring & Plan > Overview > Visibility & Planning > Overview or Flow insight.

UI example:

In addition to the UI symptoms, you will also notice that the rawflow-driver pods are not present when checking from the SSP CLI. In a healthy cluster, these pods should appear similar to the example shown below.

k get pods -A | grep rawflow
nsxi-platform       rawflowcorrelator-f-exec-3                         1/1     Running             0                 7d
nsxi-platform       rawflowcorrelator-f-exec-4                         0/1     Init:0/1            0                 2d13h
nsxi-platform       rawflowcorrelator-f-exec-5                         0/1     Init:0/1            0                 2d13h
nsxi-platform       spark-app-rawflow-driver                                          1/1     Running             0                 7d

Environment

SSP 5.0: This is the most common environment where the issue occurs.

SSP 5.1: The Spark operator has been improved, so this issue is less likely to be seen.

Cause

If the SSP/SSPI cluster experiences disruptions, such as an NTP synchronization failure that causes the cluster to become inaccessible, internal components and pod synchronization can be affected.

Even after time and NTP settings are corrected and the cluster appears to recover, pod deployment issues may still occur. In particular, the spark-operator pod, which is responsible for starting the rawflowcorrelator and spark-app-rawflow-driver pods, may not function properly, preventing them from being created.

Resolution

SSH into the SSPI cluster and run the following commands to verify and fix the issue.

1. Identify the Spark Operator pod

k get pods -n nsxi-platform | grep spark-operator
nsxi-platform       spark-operator-xxxxxxx                     1/1     Running             4                21d

2. Delete the Spark Operator pod

k delete pod <spark-operator-pod-name-that-you-noted-from-above> -n nsxi-platform
ex:
k delete pod spark-operator-xxxxxxx -n nsxi-platform

3. Verify that all rawflow pods are created

k get pods -A  | grep rawflow
Ex:
sysadmin@xxxx:~$ k get pods -A | grep rawflow
nsxi-platform       rawflowcorrelator-f7ac529a55fb2731-exec-3                         1/1     Running             0                15d
nsxi-platform       rawflowcorrelator-f7ac529a55fb2731-exec-4                         0/1     Init:0/1            0                11d
nsxi-platform       rawflowcorrelator-f7ac529a55fb2731-exec-5                         0/1     Init:0/1            0                11d
nsxi-platform       spark-app-rawflow-driver                                          1/1     Running             0                15d