Plan and Troubleshoot page is not updated with new flow data

Products

VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

This article offers some basic troubleshooting steps to determine if the root cause is due to kafka lag. It also offers some basic remediation steps to tackle this issue.

Symptoms:

In some cases, processing pipeline can be slow and as a result, no network flow is shown in 1 hour visualization view. This can be caused by lag of consuming messages in kafka raw_flow topic.
The same root cause may cause out-of-memory issue in the recommendation job, causing it to fail.
Recommendation jobs failing due to too many config updates in Druid.

Environment

VMware NSX-T Data Center
VMware NSX-T Data Center 3.x

Cause

This issue occurs due to:

The number of flows in the customer's environment is higher than the limit supported by the system.
Too many config updates from NSX can make DB lookup slow, eventually slowing down the flow pipeline or make recommendation jobs out of memory.

Resolution

This is a known issue affecting VMware NSX Intelligence 1.2.0.

Currently, there is no resolution.

Workaround:
If recommendation jobs are failing, run this command:

/opt/druid/bin/dsql -e "select config_type, count(config_type) from pace2druid_manager_realization_config group by config_type"

If the result shows VM count significantly high (e.g. 10~20 times more than the number of VMs in the system) as in the example below, then proceed to the workaround after "Restart the nsx-config service, using this command" section:

config_type	EXPR$1
MANAGER_DFW_RULE	17864
MANAGER_IP_SET	4
MANAGER_SERVICE	1424
MANAGER_SERVICE_GROUP	216
NS_GROUP	1752
PHYSICAL_SERVER	4004
TRANSPORT_NODE	4412
VM	115350 => VM data are sent more than 20 times in last 5 days.

After those procedures are done, wait for 5 minutes and run the command again. If the count is still high, redo the steps with "period": "P8D" in payload.json changed to "period": "P2D".

After that, verify the recommendation is working.

Log in to NSX Intelligence appliance with root user, and execute the following command:

/opt/kafka_2.12-2.6.0/bin/kafka-consumer-groups.sh --bootstrap-server 127.0.0.1:9092 --command-config /opt/kafka_2.12-2.6.0/config/kafka_adminclient.props --group raw_flow_group --describe

One example output:

GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
raw_flow_group raw_flow 0 102068605 102703093 634488 consumer-raw_flow_group-2-########-####-####-####-########5c24 /10.20.0.20 consumer-raw_flow_group-2
raw_flow_group raw_flow 1 101821273 102451659 630386 consumer-raw_flow_group-2-########-####-####-####-########5c24 /10.20.0.20 consumer-raw_flow_group-2
raw_flow_group raw_flow 2 102074779 102703920 629141 consumer-raw_flow_group-2-########-####-####-####-########5c24 /10.20.0.20 consumer-raw_flow_group-2

Check the numbers under "LAG" column from the output of the command, if the number is large and keep increasing, then next check the input rate to determine if the amount is larger than the maximum expected rate.

Run the following command, to identify the input rate of network flows:
grep "RATE_MONITOR" /var/log/syslog | grep "raw_flow"

An example line in the output is listed below:

2020-12-02T18:07:36.240Z sk4-pace NSX 6089 - [nsx@6876 comp="nsx-intelligence" subcomp="python" username="root" level="DEBUG"] RATE_MONITOR Current input rate for raw_flow_group-raw_flow is 70.81710765333615, max expected rate is 333

If on average, the input rate is larger than the maximum expected rate, it means the number of network flows have exceeded the maximum capacity of the system. Operating NSX Intelligence is not supported beyond the scale limitation.

Otherwise, proceed with the workaround below.
Restart the nsx-config service, using this command:

systemctl restart nsx-config

Wait up to 10 minutes for the NSX config sync to complete. Alternatively, user can proactively check if NSX config sync is completed, by checking if the following lines appear in /var/log/pace/nsx-config.log.

2020-12-02 18:19:55,676 INFO c.v.n.p.n.s.PolicyFullSyncHandleService [UpdLnr-1] INTELLIGENCE [nsx@6876 comp="nsx-intelligence" level="INFO" subcomp="manager"] Setting fullsync handler back to initial stage
2020-12-02 18:20:20,481 INFO c.v.n.p.n.s.ProtonFullSyncHandleService [UpdLnr-1] INTELLIGENCE [nsx@6876 comp="nsx-intelligence" level="INFO" subcomp="manager"] Setting fullsync handler back to initial stage
Change working directory to /tmp, and create file /tmp/payload.json.

cd /tmp
touch /tmp/payload.json

Then add the following content into /tmp/payload.json file.

[
{
"period": "P8D",
"includeFuture": true,
"type": "loadByPeriod"
},
{
"type": "dropForever"
}
]

Save and exit the file. Then run these two commands:

curl -i -X POST -H 'Content-type: application/json' -d @payload.json http://localhost:8081/druid/coordinator/v1/rules/pace2druid_policy_intent_config
curl -i -X POST -H 'Content-type: application/json' -d @payload.json http://localhost:8081/druid/coordinator/v1/rules/pace2druid_manager_realization_config

If both of them return 200 OK, proceed to the next step.
Change working directory to /etc/cron.d. Create nsx_config_periodic_sync file:

cd /etc/cron.d
touch nsx_config_periodic_sync

Add the following line into /etc/cron.d/nsx_config_periodic_sync file.

0 0 * * 0 root systemctl restart nsx-config

Then execute this command to restart cron job.

systemctl restart cron
If the NSX Intelligence appliance is small form factor, skip this step. If the appliance is large form factor, open file /opt/vmware/pace/spark/processing-start-rawflow-executor-memory.sh.

Change

EXECUTOR_MEMORY_CONF="--conf spark.executor.memory"=2g\ "--conf spark.ui.port"=4040

to

EXECUTOR_MEMORY_CONF="--conf spark.executor.memory"=4g\ "--conf spark.ui.port"=4040

Then Save and exit.
Finally restart the processing service.

systemctl restart processing

After those procedures are done, user should continue check the kafka lag for a while and make sure it is not increasing over time.