When the Retention alert is raised:
- Determine the resource that is preventing the configured retention. If the per-node shard count is at the maximum, either reduce the retention period for one or more partitions, add log store nodes, scale up to a larger deployment size (if you are not already at the largest size), or add log filters to reduce the volume of events that are persisted.
- If you reduce retention periods, review the alert again after about an hour to see the effect.
- If you prefer to add nodes or scale up, follow the sizing steps below to determine how many nodes or which size you need.
- If you introduce filters, use event types to identify event groups that can be eliminated. Start with the partitions that receive the greatest volume.
- As a last option, consider deleting a partition that is no longer needed. Deleting a partition will free up shards in the near term, but won’t reduce the rate at which they are consumed as new messages are received.
- If the resource is storage, use the build -> lifecycle UI to increase the size of the log store disks for the deployment.
How to identify which resource is the bottleneck
In the Log Management UI, open the Retention alert and read the active symptoms. Map them to a resource:
- “Additional shards are required” => shard limit is the bottleneck.
- “Additional storage required” => storage capacity may be a bottleneck.
- “Storage imbalance > 20%” => imbalance only; allow time for rebalancing, and confirm there is no shard or storage shortage at the same time.
Reliability conditions for the “Additional storage required” symptom
The “Additional storage required” symptom may generate false positives during transient cluster states. Treat this symptom as reliable only when all of the following are true:
- Ingestion has been at a stable steady state for some time, with no recent surges.
- No recent node addition, or other topology change to the log store cluster.
- The log store cluster health is green.
- No other Retention symptoms (for example, “Additional shards are required” or “Storage imbalance > 20%”) are also active at the same time.
- A few hours passed after the most recent configuration or topology change. (which ensures at least one full cleanup/sync cycle has run)
If any of those conditions is not met, wait for the cluster to settle, then re-check the alert before treating storage as the bottleneck. As an independent check, use the Log Management Health Dashboard, the Log Ingestion section to view actual per-node disk usage before making capacity decisions.
Implementing the dashboard recommendations
The Retention alert tells you which resource is short and, where applicable, by how much. Use the dashboard’s recommendation as the source of truth and apply it as follows.
When the dashboard says additional shards are required
When more shards per node are needed, you can either scale out (add nodes at the current size) or scale up (move to a larger deployment size). Increasing the deployment size increases the number of shards supported per instance. Small supports 700, medium 800, large 900 per node. For example, a 200 shard/node gap in a three instance small deployment can be addressed by adding one more small instance. Scaling up to three medium instances would be insufficient as there would continue to be a 100 shard/node gap.
Prefer scale out first, because a deployment size increase cannot be undone (you cannot scale down to a smaller size after a t-shirt size increase).
Use these instance-count limits as a guideline for when to consider scaling up instead of continuing to scale out:
- Small deployment: scale out up to about 6 instances. If you would need more than that, plan a scale-up to medium.
- Medium deployment: scale out up to about 12 instances. If you would need more than that, plan a scale-up to large.
- Large deployment: continue to scale out within the maximum supported instance count for the large size. If you are already at that maximum and still short, the remaining options are reducing retention, adding log filters, or removing a partition.
If a scale-up is required, plan for it carefully:
- A size increase is one-way; you cannot return to a smaller size later, so size to your sustained workload, not to a temporary spike.
- After scaling up, let the alert re-evaluate before considering further changes — the new size may resolve the symptom on its own.
If you would rather not change the cluster:
- Reduce the retention period on one or more partitions. Reassess the alert after about an hour.
- Add filters using event types to remove high-volume, low-value events. Start with the partitions that receive the greatest volume.
- Delete a partition that is no longer needed.
When the dashboard says additional storage is required
- First, confirm the reliability conditions above. Do not change disk size while the cluster is rebalancing, after a recent topology change, or while other Retention symptoms are also active.
- If the conditions are met, increase the log store disk size in the Lifecycle Manager UI by the amount the dashboard indicates.
- If the conditions are not met, wait for the cluster to reach a steady state, then re-check the alert. If the symptom clears on its own, no disk change is needed.
- Reducing retention or adding filters can also relieve storage pressure if you prefer not to add disk capacity.
When the dashboard shows storage imbalance > 20%
- No immediate action is required. The system rebalances shards and ages out data automatically.
- Re-check after rebalance completes. If imbalance persists alongside an additional-shards or additional-storage symptom, treat the other symptom as the primary action item.
Filtering and event types
- Use event types in the UI to identify high-volume event groups that can be filtered out, starting with the partitions that ingest the most. Filtering reduces both shard and disk pressure over time as new (smaller) backing indices roll over.
After you make a change
- Adding nodes or scaling up: allow time for the cluster to rebalance shards and for the next cleanup cycle to run before reassessing. Follow guidance KB 431162.
- Reducing retention or adding filters: reassess the alert after about an hour.