This problem is exposed through a variety of symptoms. Most revolve around communication problem symptoms for the Data Aggregator and Data Collector.
Some customers report Data Aggregator and Data Collector's are continuously showing failed and then active and up. Over and over throughout the day. They keep fluctuating all the time.
Why are the Data Collectors so often seen in a Catching Up state in DX NetOps Performance Management Portal web UI Data Collector Status pages?
Some customers report Data Aggregator systems that appear to start and run, but never allows Data Collectors to reconnect and never synchronizes with the Portal web UI.
Excessive Retired items due to excess Response Path Test items generated incorrectly due to a known defect. This is referenced in the 21.2.8 Fixed Issues documentation.
To confirm this is the cause for your problem first run the following Vsql query. Are there an excess (hundreds of thousands, often 2+ million) number of Retired items?
How many of those Retired items are problematic Response Path Items? Run this query. Is there an entry where the name value is blank but the count is high?
name | count ------------------------------------------------------+--------- | 3423322 Cisco Rttmon ICMP: 0.0.0.0-220.127.116.11 : 500 | 3 Cisco Rttmon Jitter Precision: - : 500 | 2 Cisco Rttmon PathEcho: 0.0.0.0-10.220.14.246 : 12789 | 2 Cisco Rttmon Jitter Precision: - : 502 | 2 Cisco Rttmon PathEcho: 0.0.0.0-10.220.14.246 : 18064 | 1 Cisco Rttmon PathEcho: 0.0.0.0-10.220.14.246 : 11889 | 1 Cisco Rttmon PathEcho: 0.0.0.0-10.220.14.246 : 2420 | 1 Cisco Rttmon PathEcho: 0.0.0.0-10.220.14.246 : 2815 | 1 Cisco Rttmon PathEcho: 0.0.0.0-10.220.14.246 : 12619 | 1 (10 rows)
For those Vsql queries:
All DX NetOps Performance Management releases r21.2.7 and earlier
Best resolution is upgrade to 21.2.8 or newer for new code. It will resolve the known defect causing this. It also brings automated not present item removal on a nightly basis.
The steps to remove the excess Response Path Items are the same steps to manually remove excessive Retired items.
To remove the Retired items follow the steps in the Knowledge Base article:
Do you have excess Retired items but don't see the problematic excess Response Path Test items? The steps in the following Knowledge Base article can be used to clean up the excess Retired items.
Long term recommendation until upgrading to a release with solutions is to schedule the remove_not_present_items.sh script. Run it on a nightly basis to maintain a healthy system. See the Delete Components That Are Not Present documentation topic for more information on running the script.