On setups with resource constraints or high latency, there can be additional delay when Latestflow pod processes flows and writes to Kafka.
This causes too many messages to queue in the Latestflow pod and the pod will then go OOM.
Symptom:
You may experience new flows not showing up on the Security Explorer/Visibility & Planning UI canvas in the SSP UI if this issue is occurring.
SSP Version >= 5.0
Some setups we've observed in testing have higher than expected latency when producing messages to Kafka. This can be due to network slowness, resource contention, or other factors.
For example, we can use the kafka-producer-perf-test.sh tool present in the cluster-api pod to benchmark the performance of kafka producers:
To run this test:
k -n nsxi-platform get pods | grep cluster-api
k -n nsxi-platform exec -it <name from previous command> -c cluster-api -- bash
/opt/kafka/bin/kafka-producer-perf-test.sh --topic correlated_flow_viz --num-records 1000 --record-size 1024 --throughput -1 --producer.config /root/adminclient.props
Healthy Setup:
1000 records sent
1428.6 records/sec (1.40 MB/sec)
168.49 ms avg latency, 612.00 ms max latency
Slow Setup:
1000 records sent
691.1 records/sec (0.67 MB/sec)
317.61 ms avg latency, 1221.00 ms max latency
You can see that the throughput is less than half that of the healthy setup.
When producer slowness occurs, messages can backup in the Latestflow pod causing OOM. If we look for Latestflow pods with:
k -n nsxi-platform get pods | grep latestflow
We will see that the pods have one or more restarts listed. Investigating the pod events or pod logs will lead to us finding some memory related error message.
Output similar the following can be found when describing the pod:
Labels
alertname = PodOOMKilled
container = latestflow
namespace = nsxi-platform
pod = latestflow-758bc5dfd5-6vkgx
reason = OOMKilled
severity = critical
uid = 5cc1ecd3-4ef5-4e78-80dd-9d9cc7fdcb9d
Annotations
description = Pod nsxi-platform/latestflow-758bc5dfd5-6vkgx container latestflow was terminated due to out-of-memory.
summary = Pod nsxi-platform/latestflow-758bc5dfd5-6vkgx was OOMKilled
Please Contact Broadcom support for further assistance.