When large scans across many destinations and many ports are caught by Horizontal Port Scan detector, the size of the event causes instability in the NTA event pipeline and can cause pod restarts and missed restarts.
SSP 5.0.0 and SSP 5.1.0
The size of the event detected is not properly handled by the event processing pipeline, and causes instability when processing other events. This can have a variety of symptoms, most notably consistent restarts to the nta-server pod as below
Login to SSPI cli via sysadmin if SSP5.1 or root credentials if SSP5.0
2. check the pod logs and symptoms as below:
k -n nsxi-platform get pods | grep nta-server
k -n nsxi-platform logs <nta-server-pod-name>
Error snapshot from nta-server pod log :
Error handler threw an exception
org.springframework.kafka.KafkaException: Seek to current after exception; nested exception is org.springframework.kafka.listener.ListenerExecutionFailedException: Listener method 'private void com.vmware.nsx.pace.anomalydetection.service.messaging.AnomalyKafkaListenerService.listen(java.lang.String)' threw exception; nested exception is org.springframework.kafka.KafkaException: Send failed; nested exception is org.apache.kafka.common.errors.RecordTooLargeException: The message is 1226218 bytes when serialized which is larger than 1048576, which is the value of the max.request.size configuration.; nested exception is org.springframework.kafka.KafkaException: Send failed; nested exception is org.apache.kafka.common.errors.RecordTooLargeException: The message is 1226218 bytes when serialized which is larger than 1048576, which is the value of the max.request.size configuration.
at org.springframework.kafka.listener.SeekUtils.seekOrRecover(SeekUtils.java:208)
at org.springframework.kafka.listener.DefaultErrorHandler.handleRemaining(DefaultErrorHandler.java:174)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.invokeErrorHandler(KafkaMessageListenerContainer.java:2854)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.doInvokeRecordListener(KafkaMessageListenerContainer.java:2722)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.doInvokeWithRecords(KafkaMessageListenerContainer.java:2572)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.invokeRecordListener(KafkaMessageListenerContainer.java:2448)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.invokeListener(KafkaMessageListenerContainer.java:2078)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.invokeIfHaveRecords(KafkaMessageListenerContainer.java:1430)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.pollAndInvoke(KafkaMessageListenerContainer.java:1394)
at org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.run(KafkaMessageListenerContainer.java:1291)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
The best way to verify the root cause is disable the Horizontal Port Scan detector and see if nta-server stabilizes. and if pod stabilizes , wait for 24 hours to see if the issue is resolved.
This is because we have observed different symptoms on different setups, such as kafka message size too large error, nta-server pod OOM, kafka message lag, etc. However, as an example, this is how you would inspect the logs for additional information:
STEPS to disable HORIZONTAL_PORT_SCAN detector from SSP UI :
Login SSP using admin credentials and Navigate to system->settings -> Data Collection -> Detector Activation and select Horizontal Port Scan and click on Deactivate the Detector
if nta-server pod stabilized after disabling the HORIZONTAL_PORT_SCAN detector , Often this issue is caused by a internal security scanners
Workaround:
all of the scanning computes should be added to the Horizontal Port Scan exclusion list. For details on how to add computes to the exclusion list, please see the public documentation:
Managing the Suspicious Traffic Detector Definitions
Additionally, if the nta-server is restarting, it may not be possible to disable the detector from the UI. If this is the case, the detector can be disabled using kubectl and SQL commands.
NOTE: if still issue persists after applying workaround or you are unable to disable detector from SSP UI , please contact Broadcom Support for Resolution
Issue will be fixed in further releases