vDefend SSP Alarm: Analytics and Data Storage disk usage is growing faster than expected.

search cancel

vDefend SSP Alarm: Analytics and Data Storage disk usage is growing faster than expected.

book

Article ID: 384117

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

You are running SSP 5.0 and later.
Analytics flows disk usage alarm is observed if SSP estimates that the disks won't be able to store flows for current retention days: "Analytics and Data Storage disk usage is growing faster than expected"
- With description: "The rate at which disk space used for analytics and data storage is growing is exceeding expectations. The rate is represented by a ratio of currentRetentionDays/predictedFullDays."
- The threshold for the ratio of currentRetentionDays to predictedFullDays is configured to 1. If the actual ratio exceeds this threshold, it indicates that flow storage usage is increasing at an accelerated rate. Conversely, if the actual ratio remains below 1, it signifies that flow storage usage is under control.

Environment

vDefend SSP 5.0

Cause

The primary cause of this issue is that the volume of flows exceeds the system's processing and flow storage capacity. The SSP operates under the assumption that analyzed traffic flows will exhibit some level of repetition, allowing for a certain degree of aggregation. However, when the volume of flows is excessively high and there is significant uniqueness—such as an unusually large number of unique IP addresses or ports—the system is unable to compact the data efficiently. This leads to disk usage increasing at an accelerated rate.

The alarm "Analytics and Data Storage Disk Usage Growing Faster Than Expected" is triggered when SSP estimates that the disk will be unable to store traffic flow data for the required 30-day retention period.

Resolution

Recommended Workaround: Scale out Analytics and Data Storage services

Traffic flows are stored across both the Analytics and Data Storage services. The Analytics service requires a minimum of five nodes to scale out, whereas the Data Storage service requires a minimum of eight nodes.

To determine the recommended number of worker nodes for the current traffic flow volume, utilize the SSP Sizing Tool. For detailed instructions on using this tool, refer to the relevant KB article: https://knowledge.broadcom.com/external/article/373793/security-intelligence-sizing-tool.html

Prerequisites:

All existing nodes in your Kubernetes cluster must be in a healthy and ready state before you can scale out the Security Service Platform.
Before proceeding with the scale-out procedures, ensure that your infrastructure administrator has already allocated the minimum number of nodes required for scaling out the SSP services.

Procedure

From your browser, log in with Enterprise Admin privileges to SSP at https://<ssp-fqdn>.
Navigate to System - Infrastructure - Platform & Services.
In the bottom-left corner of the Platform & Services section of the UI page, click Scale Out button.
Note: The Scale Out action is only supported if you deployed the SSP using the Advanced form factor. The action is not supported for Evaluation form factor deployment.

If all of the services are scaled out already, the Scale Out button is disabled on pop up dialog. In this case, it indicates that your cluster nodes have reached the maximum number of nodes allocated. Initially, the advanced form factor is deployed with four nodes. You must first request for your infrastructure administrator to add four more nodes to your current cluster before you can continue with the next steps. To scale out all of the services, you must have a total of eight worker nodes in your cluster.
Select the All checkbox.
In the Advanced Options section, ensure that all of the services available for the scale-out action are selected.
Unless specifically advised by the Broadcom support team, ensure that all of the core services are selected so that the system can decide which of the core services must be scaled out. Scaling out one core service arbitrarily can lead to more resources being used without any improvement to the system performance. Before proceeding with single-category service scale out procedure, consult the Broadcom support team or confirm that you know clearly what can happen if you scale out a single-category service.
Click Scale Out.
The UI displays the progress of the scale out operation.

For reference please review the "Scale Out" section(WIP) the following guide: https://techdocs.broadcom.com/us/en/vmware-security-load-balancing/vdefend/security-services-platform/5-0.html

Other options:

Note: Please try the primary workaround to scale out first before trying to the following options.

If the recommended number of worker nodes exceeds the maximum supported limit, or if scaling out to the recommended size is not currently feasible, consider implementing the following options in the order presented.

option 1: Change to Dynamic Flow Data Retention

These are the strategies to manage flow data retention if the flow storage has reached its limits:

Pause flow ingestion until storage is available (Default Option)
1. Temporarily suspends the flow of data ingestion when analytics and data storage disk is nearing the maximum capacity. When the disk usage exceeds a threshold, the flow of data ingestion is paused across all clusters and standalone hosts.
2. The formula is, threshold = flow storage - 3 * daily average usage. The threshold is determined by the daily average usage, which is calculated by the current disk usage divided by the number of days of data in storage.
3. The predicted usage is based on the existing usage. When the predicted usage drops below the threshold, the flow of data ingestion is resumed. The formula is, predicted usage = data retention period * daily average usage.
4. There are two ways to resume the flow of data ingestion.
  - Scale-out to increase the data storage disk volume and the threshold.
  - Select the Reduce flow data retention dynamically option to reduce the data retention period and the data size.
Reduce flow data retention dynamically
1. Reducing flow data retention decreases the number of days the data is stored in the database. This option prunes old data and saves storage space. The data retention is calculated based on two key factors: the size of the data and the average amount of data received per day.
2. To illustrate, here are some data retention scenarios:
  - Scenario 1: If initial data retention is configured for 30 days, and by day 15, the disk is full. The data retention is set to 15 days.
  - Scenario 2: If initial data retention is configured for 30 and very little data is received for the first 14 days. Then, on day 15, there is a data influx, causing the disk to become full. The data retention is reduced to 15 days.
  - Scenario 3: If initial data retention is configured for 30 days, the disk is full on day two. The data retention is reduced to two days.

Procedures:

From your browser, log in with Enterprise Administrator privileges to an SSP at https://<ssp-fqdn>.
In the SSP UI, select System tab Settings section, select Data Collection.
Under "When the flow storage has reached its limit, the selected option is applied across all clusters and standalone hosts", there are two options Reduce flow data retention dynamically and Pause flow ingestion until storage is available
1. Choose option: "Reduce flow data retention dynamically"

You can view the data retention period and number of existing flows on SSP UI.
Select System > Platform & Services > Metrics and scroll to the Druid Average Retention Days.
Select System > Platform & Services > Metrics and scroll to the Total Flows and Unique Flows.

option 2: Configure Data Collection in SSP

If you can identify the ESXi hosts and vSphere clusters with mostly East-West (EW) traffic, for example over 90% of traffic is EW and 10% is North-South (NS), you can enable data collection for those EW traffic first and gradually enable for NS. North-south traffic tends to have more unique IPs, which is more likely to adversely affect the data compaction.

Procedure:

By default, SSP collects network traffic data on all standalone hosts and clusters of hosts. If necessary, you can optionally stop data collection from a standalone host or cluster of hosts.

From your browser, log in with Enterprise Administrator privileges to an SSP at https://<ssp-fqdn>.
In the SSP UI, select System tab Settings section, select Data Collection.
To manage traffic data collection for one or more hosts, perform one of the following steps.
The system updates the Collection Status value for each affected host to Deactivated or Activated, depending on the data collection mode you had set.
1. To stop traffic data collection, select the host or hosts in the Standalone Host section, click Deactivate, and click Confirm when prompted if you are sure.
2. To start traffic data collection, select the host or hosts, click Activate, and click Confirm when prompted if you are sure.
To manage traffic data collection for one or more clusters of hosts, perform one of the following steps.
1. To stop data collection for one or more clusters, select the cluster or clusters in the Cluster section, click Deactivate, and click Confirm when prompted if you are sure.
2. To start traffic data collection, select the cluster or clusters, click Activate, and click Confirm when prompted if you are sure.

For reference please review the "Configure SSP Settings" section(WIP) in the following guide: https://techdocs.broadcom.com/us/en/vmware-security-load-balancing/vdefend/security-services-platform/5-0.html

option 3: Filter out broadcast and/or multicast flows.

Note: This option can be used where broadcast and/or multicast flows are not required for security policy or similar guidance. If broadcast and/or multicast flows are important to you, do not enable this option.

You can disable broadcast and/or multicast flows from getting stored in SSP to reduce disk usage.

This will only affect new flows which are not yet processed by SSP. Existing broadcast/multicast flows will still be visible, until the retention period (30 days) is reached.

To achieve this , please contact Broadcom Support for further assistance

option 4: Enable External IP aggregation & Optimize configuration of Private IP Ranges
If you have large volume of north-south traffic, but you don't need the details of individual external (public) IPs, you can reduce the amount of data sent to SSP by performing External IP aggregation at the host. This will aggregate all external IP addresses to one value: 255.255.255.255.
Note: The external (public) IP addresses that get affected are those outside the private IP ranges. Please refer to the section below to Optimize configuration of Private IP Ranges.

Impact:

This will affect how new external flows are stored and used in SSP.

In Monitor & Plan → Visibility & Planning, in compute view, when you right click on External and select IP Addresses, you will not see the individual IP addresses of the new external flows.
In Monitor & Plan → Visibility & Planning, in group view or compute view, when you right click on External or an entity connected to Public, and select Flow Details, or when you click on a connection connected to Public, you will not see individual IP addresses of the new external flows.
Recommendation will not use the individual IP addresses of the new external flows.

To achieve this , please contact Broadcom Support for further assistance

Feedback

thumb_up Yes

thumb_down No