NAPP assumes that traffic flows being analyzed, will have some patterns and therefore a certain degree of aggregation. When there is too much uniqueness in the flows, for example in the case of too many unique IPs or ports, it does not efficiently compact the data. This results in the disk usage growing faster than expected.
When NAPP calculates the disk is unable to store flows for 30 days, the alarms is raised.
If the Overflow lag alarm is observed "The number of pending messages in the messaging topic Over Flow is above the pending message threshold of 100000.", then
If NAPP was upgraded to 4.2 and the system was scaled out prior to upgrade, then run the following on the NSX manager :
If alarm still persists after 3 hours, then
There are a couple of options which can be used to help alleviate the issue.
Option 1: Configure Data Collection in NSX Intelligence
If you can identify the ESXi hosts and vSphere clusters with mostly East-West (EW) traffic, for example over 90% of traffic is EW and 10% is North-South (NS), you can enable data collection for those EW first and gradually enable for NS. North-south traffic tends to have more unique IPs, which is more likely to adversely affect the data compaction.
This will help alleviate the high storage growth, while other tuning options are explored below.
Procedure:
By default, NSX Intelligence collects network traffic data on all standalone hosts and clusters of hosts. If necessary, you can optionally stop data collection from a standalone host or cluster of hosts.
For reference please review the following guide: https://techdocs.broadcom.com/us/en/vmware-security-load-balancing/vdefend/security-intelligence/4-0/activating-and-upgrading-vmware-nsx-intelligence.html
Option 2: Filter out broadcast and/or multicast flows.
Note: This option can be used where broadcast and/or multicast flows are not required for security policy or similar guidance. If broadcast and/or multicast flows are important to you, do no enable this option.
You can disable broadcast and/or multicast flows from getting stored in NSX Intelligence to reduce disk usage.
This will only affect new flows which are not yet processed by NSX Intelligence. Existing broadcast/multicast flows will still be visible, until the retention period (30 days) is reached.
The following steps can be used together or by themselves.
Option 3: Scale out Analytics and Data Storage services
Traffic flows are stored in both Analytics and Data Storage services. Analytics requires a minimum of four nodes to scale out, Data Storage requires a minimum of eight nodes to scale out.
Since Analytics requires a lower node count, you may start with scaling out Analytics first. If the alarm doesn't get resolved, scale out both Analytics and Data Storage after you have 8 nodes.
Prerequisites:
Procedure
Note: The Scale Out action is only supported if you deployed the NSX Application Platform using the Advanced form factor. The action is not supported for Standard form factor deployment.
If all of the services are scaled out already, the Scale Out button is disabled on the drop-down menu. In this case, it indicates that your cluster nodes have reached the maximum number of nodes allocated. Initially, the advanced form factor is deployed with three nodes. You must first request for your infrastructure administrator to add five more nodes to your current cluster before you can continue with the next steps. To scale out all of the services, you must have a total of eight worker nodes in your cluster.
Unless specifically advised by the VMware support team, ensure that all of the core services are selected so that the system can decide which of the core services must be scaled out. Scaling out one core service arbitrarily can lead to more resources being used without any improvement to the system performance. Before proceeding with single-category service scale out procedure, consult the VMware support team or confirm that you know clearly what can happen if you scale out a single-category service.
The UI displays the progress of the scale out operation.
For reference please review the following guide: https://techdocs.broadcom.com/us/en/vmware-security-load-balancing/vdefend/vmware-nsx-application-platform/4-2/deploying-and-managing-the-nsx-application-platform/managing-the-nsx-application-platform/scale-out-the-nsx-application-platform.html
Option 4: Enable External IP aggregation
If you have large volume of north-south traffic, but you don't need the details of individual external (public) IPs, you can reduce the amount of data sent to NSX Intelligence by performing External IP aggregation at the host. This will aggregate all external IP addresses to one value: 255.255.255.255.
Note: The external (public) IP addresses that get affected are those outside the private IP ranges. Please refer to the section below to Optimize configuration of Private IP Ranges.
ATTENTION:
This will affect how new external flows are stored and used in NSX Intelligence.
Procedure
Login as root via ssh on the NSX Manager and run the following command and enter the admin password for NSX Manager when prompted.
curl --location --request PATCH 'https://<nsx-manager-ip-address>/policy/api/v1/infra/sites/default/intelligence/transport-node-profile' -H 'Content-Type: application/json' -d '{"enable_external_ip_aggregation": true}' -k -u admin
Optimize configuration of Private IP Ranges
You can manage the private IP Ranges using the Private IP Ranges tab in Security - General Security Settings user interface. These private IP ranges are applicable for use by the NSX Intelligence and the NSX Network Detection and Response features when you activate either feature.
To enter an IPv4 IP range, click inside the IPv4 IP Range text box and enter the values, using IPv4 IP CIDR notation format shown below the box. Press Enter for each entry, and click Save when finished.
To enter an IPv6 IP range, click inside the IPv6 IP Range text box and enter the values, using the IPv6 CIDR notation format shown below the box. Press Enter for each entry and click Save when finished.
The NSX Intelligence feature categorizes an IP address belonging to one of the CIDR notations listed in the dialog box as a private IP address. Any IP address that does not belong to any of these CIDR notations is classified as a public IP address. If the IP address of your VM or physical server does not fall into one of these CIDR notations, consider adding your CIDR notation using this Private IP Ranges UI.
For reference please review the following guide: https://techdocs.broadcom.com/us/en/vmware-security-load-balancing/vdefend/security-intelligence/4-0/using-and-managing-vmware-nsx-intelligence.html
If you are still experiencing issues after the above options, please open a support request with VMware NSX-T GSS and reference this KB.
The "predicted_full_period" and "current_retention_period" were static after the alarm was raised. Please follow the steps to check the latest predicted_full_period and current_retention_period.
1. ssh to the NSX manager
2. Run the command to get monitor pod name:
napp-k get pod -l=app.kubernetes.io/name=monitor,cluster-api-client=true
3 Run the command to check the latest predicted_full_period:
napp-k logs <monitor-pod-name> | grep "expect full days" | tail -1
You can find the predicted_full_period value after "expect full days" in the log
4 Run the command to check the latest current_retention_period
napp-k logs <monitor-pod-name> | grep "Druid Metrics" | tail -1
You can find the current_retention_period value before "retention days for correlated_flow_viz" in the log
Notice: Only check predicted_full_period and current_retention_period's values this way when the alarm is opened.