False "DEVICE HAS STOPPED RESPONDING TO POLLS" seen on Cisco ASR 9K device models in Spectrum
Article ID: 270721
DX NetOpsCA Spectrum
The cisco ASR 9k platform the SNMP server service can occasionally crash. It restarts automatically on crash so should not be an issue as long as the snmp poller retries the request.
But due to this we see a large number of false positives for "device has stopped responding to polls"
Previously we also saw a large number of "CHASSIS DOWN" and "BLADE STATUS UNKNOWN" but they disappeared after I disabled EnableEntityModuleModeling on these devices.
In every single case, if you poll the device manually from Spectrum it shows as success and the event clears. If we do not poll it manually the event will normally clear within 1 or 2 automatic polls.
I have tried mitigating this by setting a high timeout value and a polling interval of 600 (but I am aware that Spectrum will poll the device based on user clicks in OneClick and when updating other information collected from the device in addition to the scheduled polling cycle)
We currently work around this by having a 4 minute delay filter on all alarms from ASR 9k-nodes but this is an operational risk and we would like to eliminate these false positives entirely.