You are running NSX Application Platform (NAPP), and you have NSX Intelligence enabled. You don't see new flows on Intelligence UI, only last one month Flows are available in the UI, but no new flows are being collected and processed.
NSX Application Platform (NAPP) version 4.2.x
Spark applications processing flows store data in minio which were not properly cleaned up during app restart.
The spark-app-rawflow-driver
pod will keep repeatedly restarting and show below errors.
The spark-app-rawflow
sparkapp will be in SUBMISSION_FAILED state.
ssh to nsx manager as root and run below command.
napp-k logs spark-app-rawflow-driver
2025-01-26T13:25:05.431628493Z stdout F 2025-01-26T13:25:05.431Z INFO stream execution thread for rawflow_processing_query [id = 1cd8222e-15bd-48f8-9a8f-6481e0bd1e32, runId = 091dda04-81d4-4f51-9201-f109fcff46e7] MicroBatchExecution - Stream started from {KafkaV2[Subscribe[raw_flow]]: {"raw_flow":{"0":783573458,"1":783887580,"2":783685939,"3":186189823,"4":186194923,"5":186192819,"6":186087705}}}
2025-01-26T13:25:05.715729562Z stdout F 2025-01-26T13:25:05.715Z WARN task-result-getter-1 TaskSetManager - Lost task 4.0 in stage 14.0 (TID 238) (192.168.35.122 executor 5): java.lang.IllegalStateException: Cannot fetch offset 186087705 (GroupId: raw_flow_group, TopicPartition: raw_flow-6).
2025-01-26T13:25:05.715752614Z stdout F Some data may have been lost because they are not available in Kafka any more; either the
2025-01-26T13:25:05.715759898Z stdout F data was aged out by Kafka or the topic may have been deleted before all the data in the
2025-01-26T13:25:05.71576254Z stdout F topic was processed. If you don't want your streaming query to fail on such cases, set the
2025-01-26T13:25:05.71576523Z stdout F source option "failOnDataLoss" to "false".
2025-01-26T13:25:05.715767864Z stdout F
2025-01-26T13:25:05.715770652Z stdout F at org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer$.org$apache$spark$sql$kafka010$consumer$KafkaDataConsumer$$reportDataLoss0(KafkaDataConsumer.scala:724)
ssh to nsx manager via root run below commands
- napp-k edit cm rawflow-override-properties
- change checkpoint.path
property. Change it to something new like processing-checkpoints-new
- napp-k delete pod spark-app-rawflow-driver
to restart app.
- If app still doesn't restart, then we can Change the driver->coreRequest
by a small amount of 1m
. If your initial request was 100m
, increase to 1m
to 101m
.
- napp-k edit sparkapp spark-app-rawflow
. Change the driver->coreRequest
by 1m
- App will get submitted again and use the new checkpoint location.
-- verify the spark app spark-app-rawflow
and spark-app-rawflow-driver
pod is running
napp-k get sparkapp
napp-k get pods | grep spark-app-rawflow-driver
-- Verify if the new flows can be seen on the NSX Intelligence UI.
Note: checkpoint.path
value can only contain alphanumeric characters and hyphen. Using underscore or other special characters in path will lead to crash of raw flow pod services.