Best Practice to deal with excessive SNMP and ICMP communication lost alarms in DX Spectrum. DEVICE STOPPED RESPONDING and MANAGEMENT AGENT LOST
search cancel

Best Practice to deal with excessive SNMP and ICMP communication lost alarms in DX Spectrum. DEVICE STOPPED RESPONDING and MANAGEMENT AGENT LOST

book

Article ID: 21760

calendar_today

Updated On:

Products

CA Spectrum DX NetOps

Issue/Introduction

This best practice knowledge base document will focus on Management Agent Lost alarms and Device Has Stopped Responding to Polls alarms.

The following alarms are generated in Spectrum OneClick more often than the actual device agent goes down:

MANAGEMENT AGENT LOST (alarm id 0x10701)
DEVICE HAS STOPPED RESPONDING TO POLLS (alarm id 0x10009)

The following events are seen:

0x10d35
{d "%w- %d %m-, %Y - %T"} - Device {m} of type {t} has stopped responding to polls and/or external requests. An alarm will be generated. (event [{e}])

0x10daa
{d "%w- %d %m-, %Y - %T"} - Device {m} of type {t} is no longer responding to primary management requests (e.g. SNMP), but appears to be responsive to other communication protocol (e.g. ICMP). This condition has persisted for an extended amount of time. An alarm will be generated. (event [{e}])

 

Environment

Release: All Spectrum Versions

Cause

If the agent on the device is not actually going down as often as these alarms are being generated then the communication timeout values in Spectrum may not be high enough.

Resolution

CA Spectrum Modeling Information area: 



Out of the box, Spectrum is configured to send an SNMP GET to the device per the polling interval. By default on average, the polling interval is set to 300 seconds. When this interval is reached, Spectrum sends a poll.

If no response is received, then it will utilize the values in the DCM Timeout and DCM retry attributes to determine how long to wait, and how many times to send out another request. By default the timeout is 3000ms (3 seconds) and 3 retries.

If no response is received from the device within this timeframe, an ICMP ping request is sent out. If a response is received, a Management Agent Lost alarm is generated. If no response is received, a Device Has Stopped Responding to Polls alarm is generated.

There are times when the 3000ms and 3 retries is not long enough for the agent to respond. In this case, a general recommendation is to increase the timeout 10000ms and leave the retries at 3, for a total of 30 seconds. This can be done in the Information view of the device, in the CA Spectrum Modeling Information area.

Increase the DCM Timeout (ms) attribute (0x110c4) to 30000 (30 seconds)
Leave the DCM Retry Count at 3

Maximum DCM Timeout value = 60000 (60 sec / 1 minute)
Maximum DCM Retry Count = 10
Maximum Poll Interval Integer allowed is 9 digits but this is not recommended.


You can also modify these attributes on multiple models using the Attribute Editor and selecting the DCM attributes under the SNMP Communication folder (see below).

NOTE: These numbers are a general recommendation for Spectrum models generating an unusual amount of communication lost alarms and these changes may not be needed "across the board". For exact response times a sniffer trace needs to be used.

Find the problem models: 
To determine which models are generating an excessive amount of these events, you can run the <SPECROOT>/PerfCollector9 script from command line.  It will run a mysql query to obtain the event counts for 0x10d35 (Device Has Stopped Responding to Polls) and 0x10daa (Management Agent Lost) and the corresponding model handles from the DDM database

From $SPECROOT as the spectrum install owner, run:

./perfCollector9 <name of the SS>

For example:

./perfCollector9 spectrum_SS1

This will create the following files in the <SPECROOT>/Performance/<machinename>_0x10d35_summary.txt and <SPECROOT>/Performance/<machinename>_0x10daa_summary.txt

Attribute Editor: 

1. Use Locator Search to come up with a list of models to modify polling values, whether by model handle, name, Global Collection, etc. 
2. Right-click your selection an choose Attribute Editor: 



3. Expand SNMP Communication folder and move DCM and Poll attributes to the Right. Edit values as needed, and do not "Set as Default" the new values as desired (setting as default causes an issue with the fault isolation code).

Additional Information

TECH TIP:  How to define the amount time before an alarm is generated on a device? https://knowledge.broadcom.com/external/article?articleId=111749

 

If you obtain a list of models from perfcollector, here is how you can change those models in bulk.

Put the model handles in a file, one on each line.

Open Locater – new Search – Attribute – Model Handle (0x129fa) Equal To – Select “Prompt When Launched” (on the right) – Click Launch – Click the “List…” button – Click the Import button – Select the “models_to_increase_DCM_timeout.txt” file.   Click OK and then click ok again (in the window that shows “Using list of values…”   That should launch the search so click the “Cancel” button in the create search window.

Select all models (control a) – right click – Utilities – Attribute Editor – SNMP Communication – DCM Timeout – move to the right – set it to 10000 (10 seconds).  Do NOT select “Set As Default!

Click Apply – then ok.  This tells Spectrum to give the devices 10 seconds to respond before generating contact lost alarms (and triggering fault isolation code).