In high load conditions (typically, around 500k flows per 5 minutes), depending on the traffic being monitored, it may occur that the pod llanta-detectors-0 crashes having run out of memory (OOM) and eventually ends up in the CrashLoopBackOff status. Depending on the specific circumstances on the affected installation, the pod may recover automatically after some time or may continue staying in that error state.
We have identified two reasons that may lead to this situation.
First, in some conditions (depending on the particular traffic topology and mix), processing a high volume of flows may lead the llanta-detectors-0 pod to use more memory than it is allowed by the system’s configuration. In this case, one or more of its containers will end up in the OOM condition.
Second, in some conditions, when the llanta-detectors-0 pod (more precisely, its llanta-worker container) reads network flow records from the kafka queuing system, it fetches too many records at the same time and ends up running out of memory. Notice that in this case the only container in the OOM status is the llanta-worker container.