Unable to see network flows in NSX Intelligence
search cancel

Unable to see network flows in NSX Intelligence

book

Article ID: 311849

calendar_today

Updated On:

Products

VMware NSX VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Symptoms:

  • Unable to see network flows in NSX Intelligence


napp-k get pods output shows "spark-app-rawflow-driver" pod is in running state

root:~# napp-k get pods
NAME READY STATUS RESTARTS AGE
[...]
spark-app-context-driver 2/2 Running 0 46h
spark-app-overflow-driver 2/2 Running 0 46h
spark-app-rawflow-driver 2/2 Running 0 55d
spark-job-manager-797c557885-d78jk 1/1 Running 0 98d
spark-operator-86655c8f9c-7qcj2 1/1 Running 0 98d
spark-operator-webhook-init-9j95t 0/1 Completed 0 98d
[...]

Environment

VMware NSX-T
VMware NSX-T Data Center
VMware NSX-T Data Center 3.x

Cause

  • The pod is stuck in an error state due to kafka termination issues. You can see an issue connecting to kafka node kafka-1 from all the pods in the spark rawflow app.
  • Running the command below from NSX manager node in root shows exception in Kafka.
napp-k logs spark-app-rawflow-driver -c spark-kubernetes-driver -n nsxi-platform

2023-01-10T10:19:57.116Z{UTC} WARN JobGenerator NetworkClient - [Consumer clientId=consumer-raw_flow_group-2, groupId=raw_flow_group] Error connecting to node kafka-1.kafka-headless.nsxi-platform.svc.cluster.local:9094 (id: 1 rack: null)
java.net.UnknownHostException: kafka-1.kafka-headless.nsxi-platform.svc.cluster.local: Name or service not known
        at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
        at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
        at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
        at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
        at java.net.InetAddress.getAllByName(InetAddress.java:1193)
        at java.net.InetAddress.getAllByName(InetAddress.java:1127)
        at org.apache.kafka.clients.ClientUtils.resolve(ClientUtils.java:110)
        at org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.currentAddress(ClusterConnectionStates.java:403)
        at org.apache.kafka.clients.ClusterConnectionStates$NodeConnectionState.access$200(ClusterConnectionStates.java:363)
        at org.apache.kafka.clients.ClusterConnectionStates.currentAddress(ClusterConnectionStates.java:151)
        at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:958)
        at org.apache.kafka.clients.NetworkClient.access$600(NetworkClient.java:74)
        at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1131)
        at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeUpdate(NetworkClient.java:1019)
        at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:542)
        at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:265)
        at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
        at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:227)
        at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitMetadataUpdate(ConsumerNetworkClient.java:164)
        at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(AbstractCoordinator.java:247)
        at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:485)
        at org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1268)
        at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1232)
        at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1165)
        at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.paranoidPoll(DirectKafkaInputDStream.scala:172)
        at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:191)
        at org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:228)
        at org.apache.spark.streaming.dstream.DStream.$anonfun$getOrCompute$3(DStream.scala:343)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
        at org.apache.spark.streaming.dstream.DStream.$anonfun$getOrCompute$2(DStream.scala:343)
        at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:417)
        at org.apache.spark.streaming.dstream.DStream.$anonfun$getOrCompute$1(DStream.scala:342)
        at scala.Option.orElse(Option.scala:447)
        at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:335)
        at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
        at org.apache.spark.streaming.DStreamGraph.$anonfun$generateJobs$2(DStreamGraph.scala:123)
        at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
        at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:75)
        at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
        at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
        at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
        at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:122)
        at org.apache.spark.streaming.scheduler.JobGenerator.$anonfun$generateJobs$1(JobGenerator.scala:252)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:250)
        at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:186)
        at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:91)
        at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:90)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)

Resolution

Engineering already have implemented a mechanism to detect errors and restart the app when such error occurs. This is a known bug and has been fixed in 3.2.1.1

Workaround:
To resolve this issue please kill/restart the "spark-app-rawflow-driver":

1.  Take root access cli to the active NSX manager node.
2.  Execute below command 

napp-k delete pod spark-app-rawflow-driver

3. Wait for sometime and check the pod status by running below command on NSX manager root cli.

napp-k get pods 

(look for "spark-app-rawflow-driver")

Note: The pod is stuck in an error state due to kafka termination issues. If you kill the current pod, it will restart and start a new connection. When we do a restart, we may lose a small number of flow records during the restart. This itself should not be a problem for us since the flow app being already in an error state.