Experienced a domain outage but no alarm is seen. Alarm is suppressed.

book

Article ID: 144958

calendar_today

Updated On:

Products

CA Spectrum CA eHealth

Issue/Introduction

A network domain node experienced an outage. Investigating with Spectrum we are having difficulty understanding the alarming on the devices. We do not see a specific alarm for the outage. We do see some events on devices within the domain indicating an alarm was suppressed. What happened? 

Cause

Spectrum's Fault Isolation to determine root cause of an outage is dependent on good, valid connections made between all neighbor devices within the domain. If there are incomplete connections of the node topology between neighbors, Spectrum may not be able to contact all neighbors and thus unable to determine root cause. In these cases, Spectrum by default will send events to the Fault Isolation model, and will suppress alarms on devices within the domain so that multiple outage alarms are not seen. 

You may see event 0x10302 "Device lost Contact"  - this event is -not- configured to alarm out of box. This event occurs on devices within the domain which Spectrum is unable to contact but also unable to contact neighbors as well. Typically the common 0x10D35 "Device Lost Contact" Critical Alarm is then suppressed on this device. 

Environment

Release : 10.3

 

Resolution

If Fault Isolation is not able to contact all neighbors to determine root cause of the outage, it will either assert alarm on Fault Isolation model, or on some device within that specific node depending on the VNM > Fault Isolation Disposition setting.




If the setting is default "Fault Isolation Model" then check the events on this model during the outage for clues.





If you do not want alarms on Fault Isolation model and instead on some device model, then you will want to set Fault Isolation Disposition to Device in Fault Domain. Then when Fault Isolation is unable to determine root cause, it will assert on some model in that domain. The model which Spectrum will choose is based on "Criticality" setting of each device. Now each device has the same Crit setting out of box (1) and instead will pick the model with the lowest model_handle. So if Fault Isolation Disposition is set to model in domain, and criticality is not set on devices, then you have to check each device in that domain and check all events and alarms during the outage period to find out which model Spectrum did assert the actual root cause alarm to.

From there, do set criticality of the CORE routers and switches to a higher level. The highest is "7". Thus if the most critical router is set Criticality of "7", the next time Fault Isolation is unable to determine root cause, it will assert the alarm on this model with criticality "7". And further also set models are set to "6", "5", etc if desired. Hence If Spectrum can't assert on "7" for which ever reason, it would fall down to "6" or "5" etc. But the Fault Isolation Disposition setting also has to be set to "Device in Fault Domain"




Further, there cannot be any Event Customization on events related to Fault Isolation. IN doing so this can mess up Spectrum's algorithms for determining root cause, and alarms may not get asserted properly. So be sure there are no customizations on the below list of events. Note that even changing Severity Value on some of these events (like change from Critical to Major alarm) can cause Fault Isolation to behave erratic.

https://techdocs.broadcom.com/content/broadcom/techdocs/us/en/ca-enterprise-software/it-operations-management/spectrum/10-3-2/managing-network/event-configuration/event-and-alarm-customization.html