No new flows/Data are seen on NSX Intelligence UI.

Products

VMware vDefend Firewall with Advanced Threat Prevention VMware vDefend Firewall

Issue/Introduction

You are running NSX Application Platform (NAPP), and you have NSX Intelligence enabled. You don't see new flows on Intelligence UI, only last one month Flows are available in the UI, but no new flows are being collected and processed.

Environment

NSX Application Platform (NAPP) version 4.2.x

Cause

Spark applications processing flows store data in minio which were not properly cleaned up during app restart.

The spark-app-rawflow-driver pod will keep repeatedly restarting and show below errors.

The spark-app-rawflow sparkapp will be in SUBMISSION_FAILED state.

ssh to nsx manager as root and run below command.

napp-k logs spark-app-rawflow-driver

2025-01-26T13:25:05.431628493Z stdout F 2025-01-26T13:25:05.431Z INFO stream execution thread for rawflow_processing_query [id = 1cd8222e-15bd-48f8-9a8f-6481e0bd1e32, runId = 091dda04-81d4-4f51-9201-f109fcff46e7] MicroBatchExecution - Stream started from {KafkaV2[Subscribe[raw_flow]]: {"raw_flow":{"0":783573458,"1":783887580,"2":783685939,"3":186189823,"4":186194923,"5":186192819,"6":186087705}}}
2025-01-26T13:25:05.715729562Z stdout F 2025-01-26T13:25:05.715Z WARN task-result-getter-1 TaskSetManager - Lost task 4.0 in stage 14.0 (TID 238) (192.168.35.122 executor 5): java.lang.IllegalStateException: Cannot fetch offset 186087705 (GroupId: raw_flow_group, TopicPartition: raw_flow-6).
2025-01-26T13:25:05.715752614Z stdout F Some data may have been lost because they are not available in Kafka any more; either the
2025-01-26T13:25:05.715759898Z stdout F data was aged out by Kafka or the topic may have been deleted before all the data in the
2025-01-26T13:25:05.71576254Z stdout F topic was processed. If you don't want your streaming query to fail on such cases, set the
2025-01-26T13:25:05.71576523Z stdout F source option "failOnDataLoss" to "false".
2025-01-26T13:25:05.715767864Z stdout F
2025-01-26T13:25:05.715770652Z stdout F at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$.org$apache$spark$sql$kafka010$consumer$KafkaDataConsumer$$reportDataLoss0(KafkaDataConsumer.scala:724)

Resolution

ssh to nsx manager via root run below commands

- napp-k edit cm rawflow-override-properties

- change checkpoint.path property. Change it to something new like processing-checkpoints-new

- napp-k delete pod spark-app-rawflow-driver to restart app.

- If app still doesn't restart, then we can Change the driver->coreRequest by a small amount of 1m. If your initial request was 100m, increase to 1m to 101m.

- napp-k edit sparkapp spark-app-rawflow. Change the driver->coreRequest by 1m

- App will get submitted again and use the new checkpoint location.

-- verify the spark app spark-app-rawflow and spark-app-rawflow-driver pod is running

napp-k get sparkapp

napp-k get pods | grep spark-app-rawflow-driver

-- Verify if the new flows can be seen on the NSX Intelligence UI.

Additional Information

Note: checkpoint.path value can only contain alphanumeric characters and hyphen. Using underscore or other special characters in path will lead to crash of raw flow pod services.