Plan and Troubleshoot page is not updated with new flow data
search cancel

Plan and Troubleshoot page is not updated with new flow data

book

Article ID: 321152

calendar_today

Updated On: 09-18-2024

Products

VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

This article offers some basic troubleshooting steps to determine if the root cause is due to kafka lag. It also offers some basic remediation steps to tackle this issue.

Symptoms:
  • In some cases, processing pipeline can be slow and as a result, no network flow is shown in 1 hour visualization view. This can be caused by lag of consuming messages in kafka raw_flow topic.
  • The same root cause may cause out-of-memory issue in the recommendation job, causing it to fail.
  • Recommendation jobs failing due to too many config updates in Druid.


Environment

VMware NSX-T Data Center
VMware NSX-T Data Center 3.x

Cause

This issue occurs due to:
  1. The number of flows in the customer's environment is higher than the limit supported by the system.
  2. Too many config updates from NSX can make DB lookup slow, eventually slowing down the flow pipeline or make recommendation jobs out of memory.

Resolution

This is a known issue affecting VMware NSX Intelligence 1.2.0.

Currently, there is no resolution.

Workaround:
If recommendation jobs are failing, run this command:

/opt/druid/bin/dsql -e "select config_type, count(config_type) from pace2druid_manager_realization_config group by config_type"

If the result shows VM count significantly high (e.g. 10~20 times more than the number of VMs in the system) as in the example below, then proceed to the workaround after "Restart the nsx-config service, using this command" section:
 

config_type EXPR$1
MANAGER_DFW_RULE 17864
MANAGER_IP_SET  4
MANAGER_SERVICE 1424
MANAGER_SERVICE_GROUP 216
NS_GROUP 1752
PHYSICAL_SERVER 4004
TRANSPORT_NODE 4412
VM 115350 => VM data are sent more than 20 times in last 5 days.


After those procedures are done, wait for 5 minutes and run the command again. If the count is still high, redo the steps with "period": "P8D" in payload.json changed to "period": "P2D".

After that, verify the recommendation is working.

Log in to NSX Intelligence appliance with root user, and execute the following command: 

/opt/kafka_2.12-2.6.0/bin/kafka-consumer-groups.sh --bootstrap-server 127.0.0.1:9092 --command-config /opt/kafka_2.12-2.6.0/config/kafka_adminclient.props --group raw_flow_group --describe

One example output:

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG            CONSUMER-ID                                                    HOST            CLIENT-ID
raw_flow_group  raw_flow        0          102068605       102703093       634488          consumer-raw_flow_group-2-########-####-####-####-########5c24 /10.20.0.20     consumer-raw_flow_group-2
raw_flow_group  raw_flow        1          101821273       102451659       630386          consumer-raw_flow_group-2-########-####-####-####-########5c24 /10.20.0.20     consumer-raw_flow_group-2
raw_flow_group  raw_flow        2          102074779       102703920       629141          consumer-raw_flow_group-2-########-####-####-####-########5c24 /10.20.0.20     consumer-raw_flow_group-2


Check the numbers under "LAG" column from the output of the command, if the number is large and keep increasing, then next check the input rate to determine if the amount is larger than the maximum expected rate.

  1. Run the following command, to identify the input rate of network flows: 
    grep "RATE_MONITOR" /var/log/syslog | grep "raw_flow"

    An example line in the output is listed below:

    2020-12-02T18:07:36.240Z sk4-pace NSX 6089 - [nsx@6876 comp="nsx-intelligence" subcomp="python" username="root" level="DEBUG"] RATE_MONITOR Current input rate for raw_flow_group-raw_flow is 70.81710765333615, max expected rate is 333

    If on average, the input rate is larger than the maximum expected rate, it means the number of network flows have exceeded the maximum capacity of the system. Operating NSX Intelligence is not supported beyond the scale limitation.

    Otherwise, proceed with the workaround below.
     
  2. Restart the nsx-config service, using this command:

    systemctl restart nsx-config

    Wait up to 10 minutes for the NSX config sync to complete. Alternatively, user can proactively check if NSX config sync is completed, by checking if the following lines appear in /var/log/pace/nsx-config.log.

    2020-12-02 18:19:55,676 INFO c.v.n.p.n.s.PolicyFullSyncHandleService [UpdLnr-1] INTELLIGENCE [nsx@6876 comp="nsx-intelligence" level="INFO" subcomp="manager"] Setting fullsync handler back to initial stage
    2020-12-02 18:20:20,481 INFO c.v.n.p.n.s.ProtonFullSyncHandleService [UpdLnr-1] INTELLIGENCE [nsx@6876 comp="nsx-intelligence" level="INFO" subcomp="manager"] Setting fullsync handler back to initial stage

     
  3. Change working directory to /tmp, and create file /tmp/payload.json.

    cd /tmp
    touch /tmp/payload.json


    Then add the following content into /tmp/payload.json file.

    [
        {
            "period": "P8D",
            "includeFuture": true,
            "type": "loadByPeriod"
        },
        {
            "type": "dropForever"
        }
    ]


    Save and exit the file. Then run these two commands:

    curl -i -X POST -H 'Content-type: application/json' -d @payload.json http://localhost:8081/druid/coordinator/v1/rules/pace2druid_policy_intent_config
    curl -i -X POST -H 'Content-type: application/json' -d @payload.json http://localhost:8081/druid/coordinator/v1/rules/pace2druid_manager_realization_config


    If both of them return 200 OK, proceed to the next step.
     
  4. Change working directory to /etc/cron.d. Create nsx_config_periodic_sync file:

    cd /etc/cron.d
    touch nsx_config_periodic_sync


    Add the following line into /etc/cron.d/nsx_config_periodic_sync file.

    0 0 * * 0 root systemctl restart nsx-config

    Then execute this command to restart cron job.

    systemctl restart cron
     
  5. If the NSX Intelligence appliance is small form factor, skip this step. If the appliance is large form factor, open file /opt/vmware/pace/spark/processing-start-rawflow-executor-memory.sh.

    Change

    EXECUTOR_MEMORY_CONF="--conf spark.executor.memory"=2g\ "--conf spark.ui.port"=4040

    to

    EXECUTOR_MEMORY_CONF="--conf spark.executor.memory"=4g\ "--conf spark.ui.port"=4040

    Then Save and exit.
     
  6. Finally restart the processing service.

    systemctl restart processing

    After those procedures are done, user should continue check the kafka lag for a while and make sure it is not increasing over time.