search cancel

Suddenly majority of Performance Management device polls timeout

book

Article ID: 252539

calendar_today

Updated On:

Products

CA Performance Management - Usage and Administration DX NetOps

Issue/Introduction

We believe the Data Collector crashed or something else happened, causing Spectrum to be flooded with "device polling statistics threshold violation alarms". 

When we verified the "calculated metrics per second" graph in the DX Netops Portal we see indeed a serious drop.
At this point, we don't think it is a network issue on the customer's network as the polling resumed to a normal state only after stopping and starting the dcmd/activemq service. 

 

Environment

Release : Any supported release

Resolution

Logs indicate we send snmp and we receive timeouts:

2022-10-12T12:20:00,002 | WARN  |  300000-thread-1 | ThrottleCounterManager           | or.common.ThrottleCounterManager  279 | 39 - com.ca.im.data-collection-manager.core.common - 21.2.12.RELEASE-457 |  | Polls not sent due to TIMEOUT: /10.x.x.x=3760

The next step would be to collect some traffic with tcpdump to investigate what is actually happening on the wire. If snmp leaves DC at all.

If you see this again; choose an affected device to focus on and collect traffic for at least a couple of poll cycles for further analysis:

tcpdump -envi any -w /tmp/device_poll_timing_out.pcap host <IP>