NSX Application Platform (NAPP) disk usage increases with generated alarm 'Analytics and Data Storage disk usage is growing faster than expected' and "Lag in Messaging"

search cancel

NSX Application Platform (NAPP) disk usage increases with generated alarm 'Analytics and Data Storage disk usage is growing faster than expected' and "Lag in Messaging"

book

Article ID: 319828

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Symptoms:

You are running NAPP.
The NSX-T UI presents and alarm:

Analytics and Data Storage disk usage is growing faster than expected

If NSX Application Platform estimates that the disks won't be able to store flows for 30 days, an alarm will be raised in the NSX-T UI:

Analytics and Data Storage is expected to be full in {predicted_full_period} days, which is lower than the than current data retention period of {current_retention_period} days.

Overflow lag alarm is observed "The number of pending messages in the messaging topic Over Flow is above the pending message threshold of 100000."
Rawflow lag alarm is observed "The number of pending messages in the messaging topic Raw Flow is above the pending message threshold of 100000."

Environment

VMware NSX-T Data Center 4.x
VMware NSX-T Data Center
VMware NSX-T

Cause

NAPP assumes that traffic flows being analyzed, will have some patterns and therefore a certain degree of aggregation. When there is too much uniqueness in the flows, for example in the case of too many unique IPs or ports, it does not efficiently compact the data. This results in the disk usage growing faster than expected.
When NAPP calculates the disk is unable to store flows for 30 days, the alarms is raised.

Resolution

This is a known issue impacting NAPP.

If the Overflow lag alarm is observed "The number of pending messages in the messaging topic Over Flow is above the pending message threshold of 100000.", then

If NAPP was upgraded to 4.2 and the system was scaled out prior to upgrade, then run the following on the NSX manager :

First run this command 'napp-k get po | grep rawflowcorrelator | wc -l'
Copy the result. It should be number that shows the number of rawflowcorrelator pods that are running.
Then run this command by replacing the partitions count with the result from the prev command.
napp-k exec -it svc/cluster-api -c cluster-api -- /bin/bash -c '/opt/kafka/bin/kafka-topics.sh --bootstrap-server kafka:9092 --command-config /root/adminclient.props --alter --topic pairable_flow --partitions <number from above command>'

If alarm still persists after 3 hours, then

User can go to the kafka lag metrics in the Intelligence -> Metrics under the system overview page.
In the chart for kafka lag metrics, select the topics over_flow and pairable_flow and observe the chart.
If the lag trend is not in increasing and is more or less constant, then they can change the alarm threshold to a higher value to avoid spurious alarms.

Workaround:

There are a couple of options which can be used to help alleviate the issue.
Option 1: Configure Data Collection in NSX Intelligence

If you can identify the ESXi hosts and vSphere clusters with mostly East-West (EW) traffic, for example over 90% of traffic is EW and 10% is North-South (NS), you can enable data collection for those EW first and gradually enable for NS. North-south traffic tends to have more unique IPs, which is more likely to adversely affect the data compaction.

This will help alleviate the high storage growth, while other tuning options are explored below.

Procedure:

By default, NSX Intelligence collects network traffic data on all standalone hosts and clusters of hosts. If necessary, you can optionally stop data collection from a standalone host or cluster of hosts.

From your browser, log in with Enterprise Administrator privileges to an NSX Manager at https://<nsx-manager-ip-address>.
In the NSX Manager UI, select System and in the Settings section, select NSX Intelligence.
To manage traffic data collection for one or more hosts, perform one of the following steps.
The system updates the Collection Status value for each affected host to Deactivated or Activated, depending on the data collection mode you had set.
1. To stop traffic data collection, select the host or hosts in the Standalone Host section, click Deactivate, and click Confirm when prompted if you are sure.
2. To start traffic data collection, select the host or hosts, click Activate, and click Confirm when prompted if you are sure.
To manage traffic data collection for one or more clusters of hosts, perform one of the following steps.
1. To stop data collection for one or more clusters, select the cluster or clusters in the Cluster section, click Deactivate, and click Confirm when prompted if you are sure.
2. To start traffic data collection, select the cluster or clusters, click Activate, and click Confirm when prompted if you are sure.

For reference please review the following guide: https://techdocs.broadcom.com/us/en/vmware-security-load-balancing/vdefend/security-intelligence/4-0/activating-and-upgrading-vmware-nsx-intelligence.html

Option 2: Filter out broadcast and/or multicast flows.
Note: This option can be used where broadcast and/or multicast flows are not required for security policy or similar guidance. If broadcast and/or multicast flows are important to you, do no enable this option.

You can disable broadcast and/or multicast flows from getting stored in NSX Intelligence to reduce disk usage.

This will only affect new flows which are not yet processed by NSX Intelligence. Existing broadcast/multicast flows will still be visible, until the retention period (30 days) is reached.

The following steps can be used together or by themselves.

Disable broadcast and multicast flows at hosts
- Login as "root" on the NSX Manager and run the following commands. Enter the password for NSX Manager when prompted.
  - curl -X PATCH 'https://<nsx-manager-ip-address>/policy/api/v1/infra/sites/default/intelligence/transport-node-profile' -H 'Content-Type: application/json' -H 'Accept: application/json' -d '{"flow_exclusion_filter": [{"type": "BCAST"},{"type": "MCAST"}]}' -k -uadmin
Disable broadcast and multicast flows on NSX Intelligence
1. For environments of NAPP versions older than 4.2.
  1. Login as "root" on the NSX Manager and perform the following steps.
    - Obtain the configurations for raw flow processing from secret rawflow-override-properties and save to a file called props:
      - napp-k get secret rawflow-override-properties -o jsonpath='{.data.appliance\-override\.properties}’ | base64 -d > props
        Note: The above command is reading the property, converting it from base64 and saving the result in the file for later use.
  2. Now we edit the and change the values from false to true:
    - Before:
      flowFilter.excludeMulticast=false
      flowFilter.excludeBroadcast=false
      After:
      flowFilter.excludeMulticast=true
      flowFilter.excludeBroadcast=true
  3. Then convert the file contents bask to base64 value:
    - cat props | base64 -w 0
  4. Use the resulting base64 string from above to replace the original appliance-override.properties in secret rawflow-override-properties. This command will open a vim editor, which you can use to edit the content and save.
    1. export KUBE_EDITOR=vim.tiny
    2. napp-k edit secret rawflow-override-properties
  5. Finally, restart the rawflow-driver:
    - napp-k delete pod spark-app-rawflow-driver
2. For environments of NAPP versions 4.2 and newer.
  1. 1. 1. Login as "root" on the NSX Manager and perform the following steps.
      2. Editing the configurations for rawflow processing in configmap rawflow-override-properties
        
        export KUBE_EDITOR=vim.tiny
        
        napp-k edit cm rawflow-override-properties
        
        We will now change the following values:
        
        Before:
        flowFilter.excludeMulticast=false
        flowFilter.excludeBroadcast=false
        After:
        flowFilter.excludeMulticast=true
        flowFilter.excludeBroadcast=true
      3. Finally, restart the rawflow-driver:
        
        napp-k delete pod spark-app-rawflow-driver

Option 3: Scale out Analytics and Data Storage services

Traffic flows are stored in both Analytics and Data Storage services. Analytics requires a minimum of four nodes to scale out, Data Storage requires a minimum of eight nodes to scale out.

Since Analytics requires a lower node count, you may start with scaling out Analytics first. If the alarm doesn't get resolved, scale out both Analytics and Data Storage after you have 8 nodes.

Prerequisites:

All existing nodes in your Tanzu Kubernetes Cluster (TKC) or upstream Kubernetes cluster must be in a healthy and ready state before you can scale out the NSX Application Platform.
Before proceeding with the scale-out procedures, ensure that your infrastructure administrator has already allocated the minimum number of nodes required for scaling out the NSX Application Platform services.

Procedure

From your browser, log in with Enterprise Admin privileges to an NSX Manager at https://<nsx-manager-ip-address>.
Navigate to System - NSX Application Platform.
In the bottom-left corner of the NSX Application Platform section of the UI page, click Actions and select Scale Out from the drop-down menu.
Note: The Scale Out action is only supported if you deployed the NSX Application Platform using the Advanced form factor. The action is not supported for Standard form factor deployment.

If all of the services are scaled out already, the Scale Out button is disabled on the drop-down menu. In this case, it indicates that your cluster nodes have reached the maximum number of nodes allocated. Initially, the advanced form factor is deployed with three nodes. You must first request for your infrastructure administrator to add five more nodes to your current cluster before you can continue with the next steps. To scale out all of the services, you must have a total of eight worker nodes in your cluster.
Select the All checkbox.
In the Advanced Options section, ensure that all of the services available for the scale-out action are selected.
Unless specifically advised by the VMware support team, ensure that all of the core services are selected so that the system can decide which of the core services must be scaled out. Scaling out one core service arbitrarily can lead to more resources being used without any improvement to the system performance. Before proceeding with single-category service scale out procedure, consult the VMware support team or confirm that you know clearly what can happen if you scale out a single-category service.
Click Scale Out.
The UI displays the progress of the scale out operation.

For reference please review the following guide: https://techdocs.broadcom.com/us/en/vmware-security-load-balancing/vdefend/vmware-nsx-application-platform/4-2/deploying-and-managing-the-nsx-application-platform/managing-the-nsx-application-platform/scale-out-the-nsx-application-platform.html

Option 4: Enable External IP aggregation

If you have large volume of north-south traffic, but you don't need the details of individual external (public) IPs, you can reduce the amount of data sent to NSX Intelligence by performing External IP aggregation at the host. This will aggregate all external IP addresses to one value: 255.255.255.255.
Note: The external (public) IP addresses that get affected are those outside the private IP ranges. Please refer to the section below to Optimize configuration of Private IP Ranges.

ATTENTION:

This will affect how new external flows are stored and used in NSX Intelligence.

In Discover & Take Action, in compute view, when you right click on Public and select IP Addresses, you will not see the individual IP addresses of the new external flows.
In Discover & Take Action, in group view or compute view, when you right click on Public or an entity connected to Public, and select Flow Details, or when you click on a connection connected to Public, you will not see individual IP addresses of the new external flows.
Recommendation will not use the individual IP addresses of the new external flows.

Procedure
Login as root via ssh on the NSX Manager and run the following command and enter the admin password for NSX Manager when prompted.

curl --location --request PATCH 'https://<nsx-manager-ip-address>/policy/api/v1/infra/sites/default/intelligence/transport-node-profile' -H 'Content-Type: application/json' -d '{"enable_external_ip_aggregation": true}' -k -u admin

Optimize configuration of Private IP Ranges

If you know the private IP ranges used by east-west traffic in your network, it is recommended to set them as granular as possible. It is not recommended to create unnecessarily large IP ranges.
To maximum the benefit, use this in conjunction with option 4: Enable External IP aggregation.

You can manage the private IP Ranges using the Private IP Ranges tab in Security - General Security Settings user interface. These private IP ranges are applicable for use by the NSX Intelligence and the NSX Network Detection and Response features when you activate either feature.

To enter an IPv4 IP range, click inside the IPv4 IP Range text box and enter the values, using IPv4 IP CIDR notation format shown below the box. Press Enter for each entry, and click Save when finished.
To enter an IPv6 IP range, click inside the IPv6 IP Range text box and enter the values, using the IPv6 CIDR notation format shown below the box. Press Enter for each entry and click Save when finished.

The NSX Intelligence feature categorizes an IP address belonging to one of the CIDR notations listed in the dialog box as a private IP address. Any IP address that does not belong to any of these CIDR notations is classified as a public IP address. If the IP address of your VM or physical server does not fall into one of these CIDR notations, consider adding your CIDR notation using this Private IP Ranges UI.

For reference please review the following guide: https://techdocs.broadcom.com/us/en/vmware-security-load-balancing/vdefend/security-intelligence/4-0/using-and-managing-vmware-nsx-intelligence.html

If you are still experiencing issues after the above options, please open a support request with VMware NSX-T GSS and reference this KB.

Additional Information

The "predicted_full_period" and "current_retention_period" were static after the alarm was raised. Please follow the steps to check the latest predicted_full_period and current_retention_period.
1. ssh to the NSX manager

2. Run the command to get monitor pod name:
napp-k get pod -l=app.kubernetes.io/name=monitor,cluster-api-client=true

3 Run the command to check the latest predicted_full_period:
napp-k logs <monitor-pod-name> | grep "expect full days" | tail -1

You can find the predicted_full_period value after "expect full days" in the log

4 Run the command to check the latest current_retention_period
napp-k logs <monitor-pod-name> | grep "Druid Metrics" | tail -1

You can find the current_retention_period value before "retention days for correlated_flow_viz" in the log

Notice: Only check predicted_full_period and current_retention_period's values this way when the alarm is opened.

Feedback

thumb_up Yes

thumb_down No