Polling does not resume after short hiccup

Products

CA Performance Management - Usage and Administration DX NetOps

Issue/Introduction

We have a few (9) devices that had their polling dropped for less than 1 minute because of a network hiccup. Spectrum noticed this as well.

Because of this, Device Polling Statistics, Rule Name: Data Collector Dropped Poll Request, alarms are generated for those 9 devices.

However the issue only lasted for a few seconds, polling does not resume in PM (no problems in spectrum). The devices have "management agent lost" status in PM. This is incorrect as polling works fine.

We've seen this issue multiple times in the past. The only resolution is to stop and start polling manually for the devices. Sometimes these events go unnoticed for days which causes huge data gaps.

snmpwalk via command line shows that device works fine.

Environment

All Netops Performance Management Releases

Cause

Firewall blocks the responses

Resolution

The issue was that the connections were already present in the connection table and a policy push was performed for that VSX FireWall.

The connection persistency for that VSX FireWall is set to “Rematch connections”.

As the DC server did not re-initiate the connections there was no rematch.

Additional Information

For each new connection, the FireWall will evaluate the flow against our policy.If the flow is allowed, it will be stored in a connection table.

Connections that are listed in the connection table do not require to be rematched against the Policy.

By default, without keepalive, that connection will have a TTL in the connection table of 3600 seconds (1hour).

The SpectroServers do poll every 5 minutes if I remember correctly, indefinitely resetting the TTL for the connection. (the new polls keep the connection alive and there is no need for keepalives to be sent in order to maintain the connection).

Now the FireWall that handles that traffic is set to rematch every connection against the policy should a new policy be pushed to the FireWall. (Persistency policy).

Only connections that go through the policy decision making may be allowed through the FireWall.

As the SNMP polling connection was still present in the connection table it was still “accepted” but dropped down the line as it was not rematched.

To allow the flow to be rematched against the policy I had to kill the sessions that were present in the connection table.

To avoid such incidents to reoccur I cloned the services for the flow and used an option to override the Persistency.

SNMP polls will no more required to be rematched against the policy should a new policy be pushed on that FireWall even if the Persistency would require a rematch against the policy.