Alarms caught in Alarm Queue that should not be there.
search cancel

Alarms caught in Alarm Queue that should not be there.

book

Article ID: 444087

calendar_today

Updated On:

Products

DX Operational Observability

Issue/Introduction

A discrepancy occurs between an Alarm Queue and the Service built upon it.

The Setup: An Alarm Queue is configured with specific criteria (for example, containing specific message strings while excluding others like dpdmzuat). A Service (e.g., ### Services Alarms) is then built using this Alarm Queue as its data source.

The Issue: The Alarm Queue itself functions correctly and shows 0 alarms matching the criteria. However, when selecting or viewing the associated Service, it erroneously displays a massive volume of alarms (e.g., over 4,000 alarms) that should not be there.

Cause

This behavior is caused by making rapid, successive modifications to the Alarm Queue configuration.

When an Alarm Queue is tied directly to a service definition, saving multiple changes within a few seconds causes a processing bottleneck. To protect system stability from an overload of rapid state calculations, the backend may discard or fail to process some of these concurrent updates. This leaves the Service synchronization in an inconsistent state, causing it to display outdated or cached alarms that no longer match the queue's active criteria.

Resolution

To avoid synchronization desyncs, refrain from saving and editing queue configurations in immediate succession.

The 15-Second Rule: Always wait approximately 15 seconds after saving a change before making and saving another modification to the same Alarm Queue. This ensures the backend has enough time to completely process and apply the service sync logic.

 

If your Service is already stuck showing mismatched or unexpected alarms, perform the following steps before contacting support:

Identify a Sample Alert: Note down the specific Alert ID of an alarm that appears in the Service but is missing from the Alarm Queue.

Collect Pod Logs: Retrieve the logs from the doi-servicealarm pod during the timeframe the issue occurred.