NSX Intelligence flows are underreported or show up as UNCATEGORIZED in the Group View

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Symptoms:
Flows in the Group View show up as UNCATEGORIZED or Flow ingestion has paused. In the latter case, reduction may be seen in compute view as well.

There are a few ways to determine this issue occurred:
1. Open the druid console and check if there are “RUNNING” ingestion task for each supervisor

SSH into the NSX Manager via "root" user. Access the druid-overlord pod from NSX Manager: napp-k exec -it svc/druid-overlord -- bash
Inside the overlord, call the following API to check how many tasks are running. Search the keyword datasource. At least one task should be running for each of the following datasources: correlated_flow, correlated_flow_viz, correlated_flow_rec, pace2druid_manager_realization_config, pace2druid_policy_intent_config.
curl -X GET 'https://localhost:8290/druid/indexer/v1/runningTasks' -k

2. Search the issue in the logs

To get the logs, SSH into the NSX Manager via "root" user. On NSX manager, run the command:

napp-k get pods --selector='app.kubernetes.io/component=druid.overlord'

You should find a pod with prefix "druid-overlord-" as in this example:
napp-k get pods --selector='app.kubernetes.io/component=druid.overlord'
NAME READY STATUS RESTARTS AGE
druid-overlord-7b6849f98b-n97xm 1/1 Running 1 11h

Then run the following command to check the logs:

napp-k logs <name of the druid overlord pod>

Example:
napp-k logs druid-overlord-7b6849f98b-n97xm
2022-08-25T20:31:17,944 INFO [KafkaSupervisor-correlated_flow-Worker-0] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - Setting taskGroup sequences to [{0={6=37104640}}] for group [6]
2022-08-25T20:31:18,053 INFO [KafkaSupervisor-correlated_flow] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - [correlated_flow] supervisor is running.

In the logs, search the following string: "durationSeconds=600, active=[{id='index_kafka_", you should be able to see logs about the status of supervisors.
(1) If you are seeing UNCATEGORIZED flows, search: "durationSeconds=600, active=[{id='index_kafka_pace2druid_manager_realization_config"
(2) If you don't see any flows, search: "rationSeconds=600, active=[{id='index_kafka_correlated_flow_viz"

There will be a task inside the active field. Search the id of this task in the logs. This task should not stay more than 10 minutes in the active list. If it persists more than 20 minutes, then the issue may be due to Druid overlord.

Example below: The same task "index_kafka_pace2druid_manager_realization_config_fb79e2e6d49f685_fpfhoaap" is still active after 2.5 hours.

2022-08-01T20:07:16.741076711Z stdout F 2022-08-01T20:07:16,740 INFO [KafkaSupervisor-pace2druid_manager_realization_config] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - {id='pace2druid_manager_realization_config', generationTime=2022-08-01T20:07:16.740Z, payload=KafkaSupervisorReportPayload{dataSource='pace2druid_manager_realization_config', topic='pace2druid_manager_realization_config', partitions=1, replicas=1, durationSeconds=600, active=[{id='index_kafka_pace2druid_manager_realization_config_fb79e2e6d49f685_fpfhoaap', startTime=null, remainingSeconds=null}], publishing=[], suspended=false, healthy=false, state=UNHEALTHY_SUPERVISOR, detailedState=UNABLE_TO_CONNECT_TO_STREAM, recentErrors=[ExceptionEvent{timestamp=2022-08-01T19:58:16.726Z, exceptionClass='org.apache.kafka.common.errors.TimeoutException', message='org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata'}, ExceptionEvent{timestamp=2022-08-01T19:59:16.727Z, exceptionClass='org.apache.kafka.common.errors.TimeoutException', message='org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata'}, ExceptionEvent{timestamp=2022-08-01T20:00:03.097Z, exceptionClass='org.apache.druid.java.util.common.ISE', message='org.apache.druid.java.util.common.ISE: No partitions found for stream [pace2druid_manager_realization_config]'}]}}

2022-08-01T22:47:16.743645745Z stdout F 2022-08-01T22:47:16,742 INFO [KafkaSupervisor-pace2druid_manager_realization_config] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - {id='pace2druid_manager_realization_config', generationTime=2022-08-01T22:47:16.742Z, payload=KafkaSupervisorReportPayload{dataSource='pace2druid_manager_realization_config', topic='pace2druid_manager_realization_config', partitions=1, replicas=1, durationSeconds=600, active=[{id='index_kafka_pace2druid_manager_realization_config_fb79e2e6d49f685_fpfhoaap', startTime=null, remainingSeconds=null}], publishing=[], suspended=false, healthy=true, state=RUNNING, detailedState=RUNNING, recentErrors=[ExceptionEvent{timestamp=2022-08-01T19:58:16.726Z, exceptionClass='org.apache.kafka.common.errors.TimeoutException', message='org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata'}, ExceptionEvent{timestamp=2022-08-01T19:59:16.727Z, exceptionClass='org.apache.kafka.common.errors.TimeoutException', message='org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata'}, ExceptionEvent{timestamp=2022-08-01T20:00:03.097Z, exceptionClass='org.apache.druid.java.util.common.ISE', message='org.apache.druid.java.util.common.ISE: No partitions found for stream [pace2druid_manager_realization_config]'}]}}

This is a known issue in NSX Intelligence 4.0.1

Cause

This issue can occur after all druid/zookeeper/kafka/postgres services are down at the same time. For example, after an outage or some errors in the Kubernetes cluster.

Resolution

Issue will be resolved in a later release. Workaround should be followed at this time.

Workaround:
Restart the druid-overlord pod in the nsxi-platform namespace with the following command:

napp-k delete pod <name of the druid overlord pod>

Example:
napp-k delete pod druid-overlord-7b6849f98b-n97xm

If you face additional issues wherein Security Intelligence flows or configs are underreported in the group view , please refer the below article : Security Intelligence flows or configs are underreported in the group view

Additional Information

Impact/Risks:

Flows show up as UNCATEGORIZED and there is no flow ingestion until the druid overlord pods start processing the tasks.