On setups with resource constraints or high latency, there can be additional delay when Latestflow pod processes flows and writes to Kafka.
This causes too many messages to queue in the Latestflow pod and the pod will then go OOM.
Symptom:
You may experience new flows not showing up on the Security Explorer/Visibility & Planning UI canvas in the SSP UI if this issue is occurring.
SSP Version >= 5.0
Some setups we've observed in testing have higher than expected latency when producing messages to Kafka. This can be due to network slowness, resource contention, or other factors.
For example, we can use the kafka-producer-perf-test.sh tool present in the cluster-api pod to benchmark the performance of kafka producers:
To run this test:
k -n nsxi-platform get pods | grep cluster-api
k -n nsxi-platform exec -it <name from previous command> -c cluster-api -- bash
/opt/kafka/bin/kafka-producer-perf-test.sh --topic correlated_flow_viz --num-records 1000 --record-size 1024 --throughput -1 --producer.config /root/adminclient.props
Healthy Setup:
1000 records sent
1428.6 records/sec (1.40 MB/sec)
168.49 ms avg latency, 612.00 ms max latency
Slow Setup:
1000 records sent
691.1 records/sec (0.67 MB/sec)
317.61 ms avg latency, 1221.00 ms max latency
You can see that the throughput is less than half that of the healthy setup.
When producer slowness occurs, messages can backup in the Latestflow pod causing OOM. If we look for Latestflow pods with:
k -n nsxi-platform get pods | grep latestflow
We will see that the pods have one or more restarts listed. Investigating the pod events or pod logs will lead to us finding some memory related error message.
Output similar the following can be found when describing the pod:
Labels
alertname = PodOOMKilled
container = latestflow
namespace = nsxi-platform
pod = latestflow-758bc5dfd5-6vkgx
reason = OOMKilled
severity = critical
uid = 5cc1ecd3-4ef5-4e78-80dd-9d9cc7fdcb9d
Annotations
description = Pod nsxi-platform/latestflow-758bc5dfd5-6vkgx container latestflow was terminated due to out-of-memory.
summary = Pod nsxi-platform/latestflow-758bc5dfd5-6vkgx was OOMKilled
Basic configuration check:
For flows to appear in Security Intelligence, the VMs must be:
Either attached to an NSX Overlay or VLAN Segment.
or the DVPG that the workloads are using should be managed by NSX.
If the VMs are legacy, you can keep the VMs on your existing Distributed Virtual Port Groups (DVPGs).
You do not need to migrate them to NSX Overlay segments or change their IP networking.
You can explicitly tell NSX to "protect" those existing Port Groups that NSX might be ignoring due to which there could be zero flows.
This document should help in enabling NSX on DVPGs:
In NSX Manager, go to Security > Distributed Firewall > Actions.
Look for an option like "Activate NSX on Distributed Virtual Port Groups".
Select the specific DVPGs where your workloads reside.
What this does is:
It leaves the networking (VLANs/IPs) exactly as they are.
It inserts the NSX Security shim into those Port Groups.
It enables the Distributed Firewall, which will immediately start generating the Flow Records) that Security Intelligence needs.
If the configuration is proper, contact Broadcom support for further assistance to resolve this issue.