We have devices (quite a few) that all of a sudden when the device is sending their clear (...is OK or ...has returned to Normal) message, that alarm is coming into the alarm console as an major and not actually clearing the alarm but just creating another alarm. Not sure what would cause this, but it seems its been going on for us for a while now. How can we correct this?
Customer’s ToT auto-clear interval was set to 60 seconds for the devices that were throwing clear alarms, e.g., messages ending with …is ‘OK’ and …”has returned to normal,” which were being changed to major severity by the alarm_enrichment probe, and then the major alarms were clearing within 1 minute. The 1-minute clear time frame was the clue as to what was happening.
As it states in the Help doc for ToT configuration of events, you cannot set the clear interval to be LESS THAN the ToT monitoring interval. In the customer’s environment, their templates which were applied to specific devices in PROD and TEST were identical, yet the issue was ONLY occurring in their Production environment. The reason for this was due to the ‘flapping’ behavior (alarm generation and clearing), for a set of specific devices in Production.
The Clear Delay Time <TC> value MUST NOT be less than the Time Over Threshold <TOT> interval value for automatically clearing alarms. This applies to each and every QOS metric that has a ToT setting in ALL of your ACTIVE snmpcollector templates. Once the auto-clear interval value is set to be greater than the ToT interval value, frequent generation of alarms or unexpected alarm ‘effects’ such as described above will no longer occur.
This is stated in the Best Practices section for 'Time Over Threshold' event rule (How-to Articles)
The Clear Delay Time <TC> value MUST NOT be less than the Time Over Threshold <TOT> interval value for automatically clearing alarms.
For this customer, their snmpcollector templates in TEST/DEV were EXACTLY the same as their Templates in PROD and the overall configuration was the same for a long time. These alarm issues started occurring unexpectedly ‘out of the blue’ in Production. The reason they started occurring was most likely due to the particular rapid/‘flapping’ alarming/clearing behavior of some of their devices. Therefore, it did not occur nor was it reproducible in their Test environment or Broadcom’s own Lab/Test environment.
WARNING: Setting a smaller Auto-clear window may result in an excessive number of alarms as well as cause other unexpected alarm results.
One the Auto clear value was set to a value greater than the ToT interval, the issue no longer occurred, for example:
"Time Over Threshold <TOT>" set to 16 but with "Clear Delay Time <TC>" now set to 20 minutes.