New flows not showing up in Security Explorer/Visibility & Planning

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

On setups with resource constraints or high latency, there can be additional delay when Latestflow pod processes flows and writes to Kafka.

This causes too many messages to queue in the Latestflow pod and the pod will then go OOM.

Symptom:

You may experience new flows not showing up on the Security Explorer/Visibility & Planning UI canvas in the SSP UI if this issue is occurring.

Environment

SSP Version >= 5.0

Cause

Some setups we've observed in testing have higher than expected latency when producing messages to Kafka. This can be due to network slowness, resource contention, or other factors.

For example, we can use the kafka-producer-perf-test.sh tool present in the cluster-api pod to benchmark the performance of kafka producers:

To run this test:

SSH into SSPI via root or sysadmin as per 5.0 / 5.1 versions

Exec into the cluster-api pod

k -n nsxi-platform get pods | grep cluster-api

k -n nsxi-platform exec -it <name from previous command> -c cluster-api -- bash

Run the command:

/opt/kafka/bin/kafka-producer-perf-test.sh --topic correlated_flow_viz --num-records 1000 --record-size 1024 --throughput -1 --producer.config /root/adminclient.props

Healthy Setup:
- 1000 records sent
- 1428.6 records/sec (1.40 MB/sec)
- 168.49 ms avg latency, 612.00 ms max latency
Slow Setup:
- 1000 records sent
- 691.1 records/sec (0.67 MB/sec)
- 317.61 ms avg latency, 1221.00 ms max latency

You can see that the throughput is less than half that of the healthy setup.

When producer slowness occurs, messages can backup in the Latestflow pod causing OOM. If we look for Latestflow pods with:

SSH into SSPI via root or sysadmin as per 5.0 / 5.1 versions
Exec into the cluster-api pod

k -n nsxi-platform get pods | grep latestflow

We will see that the pods have one or more restarts listed. Investigating the pod events or pod logs will lead to us finding some memory related error message.

Output similar the following can be found when describing the pod:

Labels
alertname = PodOOMKilled
container = latestflow
namespace = nsxi-platform
pod = latestflow-758bc5dfd5-6vkgx
reason = OOMKilled
severity = critical
uid = 5cc1ecd3-4ef5-4e78-80dd-9d9cc7fdcb9d
Annotations
description = Pod nsxi-platform/latestflow-758bc5dfd5-6vkgx container latestflow was terminated due to out-of-memory.
summary = Pod nsxi-platform/latestflow-758bc5dfd5-6vkgx was OOMKilled

Resolution

Basic configuration check:

For flows to appear in Security Intelligence, the VMs must be:

Either attached to an NSX Overlay or VLAN Segment.
or the DVPG that the workloads are using should be managed by NSX.

If the VMs are legacy, you can keep the VMs on your existing Distributed Virtual Port Groups (DVPGs).

You do not need to migrate them to NSX Overlay segments or change their IP networking.

You can explicitly tell NSX to "protect" those existing Port Groups that NSX might be ignoring due to which there could be zero flows.

This document should help in enabling NSX on DVPGs:

In NSX Manager, go to Security > Distributed Firewall > Actions.
Look for an option like "Activate NSX on Distributed Virtual Port Groups".
Select the specific DVPGs where your workloads reside.

What this does is:

It leaves the networking (VLANs/IPs) exactly as they are.
It inserts the NSX Security shim into those Port Groups.
It enables the Distributed Firewall, which will immediately start generating the Flow Records) that Security Intelligence needs.

If the configuration is proper, contact Broadcom support for further assistance to resolve this issue.