Plan and Troubleshoot page is not updated with new flow data
searchcancel
Plan and Troubleshoot page is not updated with new flow data
book
Article ID: 321152
calendar_today
Updated On: 09-18-2024
Products
VMware vDefend Firewall with Advanced Threat Prevention
Issue/Introduction
This article offers some basic troubleshooting steps to determine if the root cause is due to kafka lag. It also offers some basic remediation steps to tackle this issue.
Symptoms:
In some cases, processing pipeline can be slow and as a result, no network flow is shown in 1 hour visualization view. This can be caused by lag of consuming messages in kafka raw_flow topic.
The same root cause may cause out-of-memory issue in the recommendation job, causing it to fail.
Recommendation jobs failing due to too many config updates in Druid.
Environment
VMware NSX-T Data Center VMware NSX-T Data Center 3.x
Cause
This issue occurs due to:
The number of flows in the customer's environment is higher than the limit supported by the system.
Too many config updates from NSX can make DB lookup slow, eventually slowing down the flow pipeline or make recommendation jobs out of memory.
Resolution
This is a known issue affecting VMware NSX Intelligence 1.2.0.
Currently, there is no resolution.
Workaround: If recommendation jobs are failing, run this command:
/opt/druid/bin/dsql -e "select config_type, count(config_type) from pace2druid_manager_realization_config group by config_type"
If the result shows VM count significantly high (e.g. 10~20 times more than the number of VMs in the system) as in the example below, then proceed to the workaround after "Restart the nsx-config service, using this command" section:
config_type
EXPR$1
MANAGER_DFW_RULE
17864
MANAGER_IP_SET
4
MANAGER_SERVICE
1424
MANAGER_SERVICE_GROUP
216
NS_GROUP
1752
PHYSICAL_SERVER
4004
TRANSPORT_NODE
4412
VM
115350 => VM data are sent more than 20 times in last 5 days.
After those procedures are done, wait for 5 minutes and run the command again. If the count is still high, redo the steps with "period": "P8D" in payload.json changed to "period": "P2D".
After that, verify the recommendation is working.
Log in to NSX Intelligence appliance with root user, and execute the following command:
Check the numbers under "LAG" column from the output of the command, if the number is large and keep increasing, then next check the input rate to determine if the amount is larger than the maximum expected rate.
Run the following command, to identify the input rate of network flows: grep "RATE_MONITOR" /var/log/syslog | grep "raw_flow"
An example line in the output is listed below:
2020-12-02T18:07:36.240Z sk4-pace NSX 6089 - [nsx@6876 comp="nsx-intelligence" subcomp="python" username="root" level="DEBUG"] RATE_MONITOR Current input rate for raw_flow_group-raw_flow is 70.81710765333615, max expected rate is 333
If on average, the input rate is larger than the maximum expected rate, it means the number of network flows have exceeded the maximum capacity of the system. Operating NSX Intelligence is not supported beyond the scale limitation.
Otherwise, proceed with the workaround below.
Restart the nsx-config service, using this command:
systemctl restart nsx-config
Wait up to 10 minutes for the NSX config sync to complete. Alternatively, user can proactively check if NSX config sync is completed, by checking if the following lines appear in /var/log/pace/nsx-config.log.
2020-12-02 18:19:55,676 INFO c.v.n.p.n.s.PolicyFullSyncHandleService [UpdLnr-1] INTELLIGENCE [nsx@6876 comp="nsx-intelligence" level="INFO" subcomp="manager"] Setting fullsync handler back to initial stage 2020-12-02 18:20:20,481 INFO c.v.n.p.n.s.ProtonFullSyncHandleService [UpdLnr-1] INTELLIGENCE [nsx@6876 comp="nsx-intelligence" level="INFO" subcomp="manager"] Setting fullsync handler back to initial stage
Change working directory to /tmp, and create file /tmp/payload.json.
cd /tmp touch /tmp/payload.json
Then add the following content into /tmp/payload.json file.
Save and exit the file. Then run these two commands:
curl -i -X POST -H 'Content-type: application/json' -d @payload.json http://localhost:8081/druid/coordinator/v1/rules/pace2druid_policy_intent_config curl -i -X POST -H 'Content-type: application/json' -d @payload.json http://localhost:8081/druid/coordinator/v1/rules/pace2druid_manager_realization_config
If both of them return 200 OK, proceed to the next step.
Change working directory to /etc/cron.d. Create nsx_config_periodic_sync file:
cd /etc/cron.d touch nsx_config_periodic_sync
Add the following line into /etc/cron.d/nsx_config_periodic_sync file.
0 0 * * 0 root systemctl restart nsx-config
Then execute this command to restart cron job.
systemctl restart cron
If the NSX Intelligence appliance is small form factor, skip this step. If the appliance is large form factor, open file /opt/vmware/pace/spark/processing-start-rawflow-executor-memory.sh.