Sizing and scaling guidance when the Log Management Retention alert is raised

Products

VCF Operations

Issue/Introduction

The Log Management Health Dashboard raises a Retention alert when the platform cannot meet the configured retention period for one or more log partitions with the current cluster size and ingestion rate. The Retention alert can show one or more of the following symptoms:

Additional shards are required: more shards per log store node are needed to keep data for the configured retention period at the current ingestion rate.
Additional storage required: storage capacity per node may be too small to use all of the shards a node could hold.
Storage imbalance > 20%: disk usage across log store nodes is significantly uneven, which can also reduce effective retention

The Log Management Health Dashboard does not report whether the retention period of a partition is met. Navigate to the Log Management Log Processing page for this information.

Even when a partition has a retention period configured by the administrator, the system may delete older backing indices automatically to keep the log store healthy. As a result, the actual (effective) retention can be lower than the configured retention.

Note: The “Additional storage required” symptom may produce false positives during transient cluster states. Treat this symptom as reliable only when specific conditions are met — see the reliability conditions in the Resolution section below before acting on it.

This article describes when automatic cleanup happens, how to determine which resource (shards or storage) is the bottleneck, and what to change in the cluster to meet the configured retention.

Environment

VMware Cloud Foundation Log Management 9.1
Log Management Health Dashboard

Cause

There are two main reasons why effective retention can fall below the configured retention period:

Per-node shard limit reached

Each deployment size has a maximum number of shards allowed per log store node (configured for the platform; values differ by size).
The platform monitors active shards per node. If a node is at or above the per-node maximum, the system deletes the oldest indices that have shards on overloaded nodes until the node is back under the limit. This frees shards but reduces how far back data can be retained.

Storage capacity reached

Per-node disk usage is monitored continuously and a cleanup threshold is calculated from a configurable “storage emergency” percentage with a small reserve held back for archive imports.
When projected per-node usage (current usage plus expected growth before the next cleanup cycle) is at or above the threshold, the system deletes the oldest backing indices that have shards on overloaded nodes until projected usage is below the threshold.
If storage on a node is too small relative to the shards the node can hold, the system reduces the number and size of partition shards as part of normal operation, which also reduces retention.

Other contributors

Storage imbalance across nodes can leave some nodes overloaded while others have free capacity. The system rebalances and ages out data over time, but a large imbalance can still trigger early cleanup on the busiest nodes.

Dangling indices and other operational states can consume disk space; the platform also cleans these up as part of its background work.

Resolution

When the Retention alert is raised:

Determine the resource that is preventing the configured retention. If the per-node shard count is at the maximum, either reduce the retention period for one or more partitions, add log store nodes, scale up to a larger deployment size (if you are not already at the largest size), or add log filters to reduce the volume of events that are persisted.
If you reduce retention periods, review the alert again after about an hour to see the effect.
If you prefer to add nodes or scale up, follow the sizing steps below to determine how many nodes or which size you need.
If you introduce filters, use event types to identify event groups that can be eliminated. Start with the partitions that receive the greatest volume.
As a last option, consider deleting a partition that is no longer needed. Deleting a partition will free up shards in the near term, but won’t reduce the rate at which they are consumed as new messages are received.
If the resource is storage, use the build -> lifecycle UI to increase the size of the log store disks for the deployment.

How to identify which resource is the bottleneck

In the Log Management UI, open the Retention alert and read the active symptoms. Map them to a resource:

“Additional shards are required” => shard limit is the bottleneck.
“Additional storage required” => storage capacity may be a bottleneck.
“Storage imbalance > 20%” => imbalance only; allow time for rebalancing, and confirm there is no shard or storage shortage at the same time.

Reliability conditions for the “Additional storage required” symptom

The “Additional storage required” symptom may generate false positives during transient cluster states. Treat this symptom as reliable only when all of the following are true:

Ingestion has been at a stable steady state for some time, with no recent surges.
No recent node addition, or other topology change to the log store cluster.
The log store cluster health is green.
No other Retention symptoms (for example, “Additional shards are required” or “Storage imbalance > 20%”) are also active at the same time.
A few hours passed after the most recent configuration or topology change. (which ensures at least one full cleanup/sync cycle has run)

If any of those conditions is not met, wait for the cluster to settle, then re-check the alert before treating storage as the bottleneck. As an independent check, use the Log Management Health Dashboard, the Log Ingestion section to view actual per-node disk usage before making capacity decisions.

Implementing the dashboard recommendations

The Retention alert tells you which resource is short and, where applicable, by how much. Use the dashboard’s recommendation as the source of truth and apply it as follows.

When the dashboard says additional shards are required

When more shards per node are needed, you can either scale out (add nodes at the current size) or scale up (move to a larger deployment size). Increasing the deployment size increases the number of shards supported per instance. Small supports 700, medium 800, large 900 per node. For example, a 200 shard/node gap in a three instance small deployment can be addressed by adding one more small instance. Scaling up to three medium instances would be insufficient as there would continue to be a 100 shard/node gap.

Prefer scale out first, because a deployment size increase cannot be undone (you cannot scale down to a smaller size after a t-shirt size increase).

Use these instance-count limits as a guideline for when to consider scaling up instead of continuing to scale out:

Small deployment: scale out up to about 6 instances. If you would need more than that, plan a scale-up to medium.
Medium deployment: scale out up to about 12 instances. If you would need more than that, plan a scale-up to large.
Large deployment: continue to scale out within the maximum supported instance count for the large size. If you are already at that maximum and still short, the remaining options are reducing retention, adding log filters, or removing a partition.

If a scale-up is required, plan for it carefully:

A size increase is one-way; you cannot return to a smaller size later, so size to your sustained workload, not to a temporary spike.
After scaling up, let the alert re-evaluate before considering further changes — the new size may resolve the symptom on its own.

If you would rather not change the cluster:

Reduce the retention period on one or more partitions. Reassess the alert after about an hour.
Add filters using event types to remove high-volume, low-value events. Start with the partitions that receive the greatest volume.
Delete a partition that is no longer needed.

When the dashboard says additional storage is required

First, confirm the reliability conditions above. Do not change disk size while the cluster is rebalancing, after a recent topology change, or while other Retention symptoms are also active.
If the conditions are met, increase the log store disk size in the Lifecycle Manager UI by the amount the dashboard indicates.
If the conditions are not met, wait for the cluster to reach a steady state, then re-check the alert. If the symptom clears on its own, no disk change is needed.
Reducing retention or adding filters can also relieve storage pressure if you prefer not to add disk capacity.

When the dashboard shows storage imbalance > 20%

No immediate action is required. The system rebalances shards and ages out data automatically.
Re-check after rebalance completes. If imbalance persists alongside an additional-shards or additional-storage symptom, treat the other symptom as the primary action item.

Filtering and event types

Use event types in the UI to identify high-volume event groups that can be filtered out, starting with the partitions that ingest the most. Filtering reduces both shard and disk pressure over time as new (smaller) backing indices roll over.

After you make a change

Adding nodes or scaling up: allow time for the cluster to rebalance shards and for the next cleanup cycle to run before reassessing. Follow guidance KB 431162.
Reducing retention or adding filters: reassess the alert after about an hour.

Additional Information

The platform performs background cleanup on a periodic cycle. Until the cycle completes, the alert may persist briefly even after a fix is applied.

Imbalance is reduced over time by automatic shard rebalancing and by the deletion of older data; in a balanced cluster, overall storage utilization remains below 100% because some capacity is reserved for archive imports and log management tasks.
Effective retention can be lower than the configured retention whenever the cluster cannot hold the data for the configured duration with the available shards or disk.