Data Collectors and Data Aggregator connection problems in DX Netops

Products

CA Performance Management Network Observability

Issue/Introduction

This problem is exposed through a variety of symptoms. Most revolve around communication problem symptoms for the Data Aggregator and Data Collector.

Some customers report Data Aggregator and Data Collector's are continuously showing failed and then active and up. Over and over throughout the day. They keep fluctuating all the time.

Why are the Data Collectors so often seen in a Catching Up state in DX NetOps Performance Management Portal web UI Data Collector Status pages?

Some customers report Data Aggregator systems that appear to start and run, but never allows Data Collectors to reconnect and never synchronizes with the Portal web UI.

Environment

All DX NetOps Performance Management releases r21.2.7 and earlier

Cause

Excessive Retired items due to excess Response Path Test items generated incorrectly due to a known defect.

Symptom: On Cisco devices, some of the IPSLA metric families are marked as supported and the components are created with empty name and index.
Resolution: With this fix, a key attribute in the Cisco IPSLA vendor certs has been added. You can now mark these metric families as Not Supported, and no components are created.
(21.2.8, DE519177, 32948384,32972973)

To confirm this is the cause for your problem first run the following Vsql query. Are there an excess (hundreds of thousands, often 2+ million) number of Retired items?

/opt/vertica/bin/vsql -Udradmin -W -c "\d"/opt/vertica/bin/vsql -Udradmin -W -c "select facet_qname, count(*) from <schemaName>.v_item_facet group by 1 order by 2 desc limit 30;"

How many of those Retired items are problematic Response Path Items? Run this query. Is there an entry where the name value is blank but the count is high?

/opt/vertica/bin/vsql -Udradmin -W -c "select name,count(*) from <schemaName>.v_item i where exists (select NULL from <schemaName>.v_item_facet if1 where i.item_id=if1.item_id and if1.facet_qname like '%}Retired') and exists (select NULL from <schemaName>.v_item_facet if2 where i.item_id=if2.item_id and if2.facet_qname like '%}ResponsePath%') group by 1 order by 2 desc limit 10;"

Example output is as follows. Note the empty name value with a high count in the first row.

                         name                         |  count  
------------------------------------------------------+---------
                                                      | 3423322
 Cisco Rttmon ICMP: 0.0.0.0-X.X.X.X : 500          |       3
 Cisco Rttmon Jitter Precision: - : 500               |       2
 Cisco Rttmon PathEcho: 0.0.0.0-X.X.X.X : 12789 |       2
 Cisco Rttmon Jitter Precision: - : 502               |       2
 Cisco Rttmon PathEcho: 0.0.0.0-X.X.X.X: 18064 |       1
 Cisco Rttmon PathEcho: 0.0.0.0-X.X.X.X : 11889 |       1
 Cisco Rttmon PathEcho: 0.0.0.0-X.X.X.X : 2420  |       1
 Cisco Rttmon PathEcho: 0.0.0.0-X.X.X.X : 2815  |       1
 Cisco Rttmon PathEcho: 0.0.0.0-X.X.X.X : 12619 |       1
(10 rows)

For those Vsql queries:

Run them as dradmin or equivalent OS user.
Run from default location /opt/vertica/bin
Enter password when prompted. Would be same password used to stop/start DR DB via adminTools UI.
Replace <schemaName> with the DB schema name. If schema name is unknown run this command. Note the Schema column (first column) value.
- /opt/vertica/bin/vsql -Udradmin -W -c "\d"
- Replace any instance of <schemaName> in the above commands with the value identified.

Resolution

Best resolution is upgrade to 21.2.8 or newer for new code. It will resolve the known defect causing this. It also brings automated not present item removal on a nightly basis.

The steps to remove the excess Response Path Items are the same steps to manually remove excessive Retired items.

Stop the DA.
Take an iRep backup of the DR DB in case anything goes wrong.
Remove the retired items using the vsql commands shown.
Run the cleanupDeletedItems.sh script on the DR DB. Script is attached to the KB article.
Start the DA
If needed, run a Full Sync of the DA DS. Do you need to?
1. Do you synchronize not present items to the NetOps Portal inventory?
  1. Check by going to Administration->Data Sources->Data Sources.
  2. Open the Edit UI for the Data Aggregator Data Source.
  3. Is the "Synchronize component items that are not currently present on the monitored device" option checked off and enabled?
    1. If yes, it's enabled, will need to launch a Full Sync of the DA DS after it's restarted to clean up PC inventory. Note that might slow down PC for a bit as it churns through the clean up of that many items.
    2. If not enabled, nothing further to be concerned about or do.

To remove the Retired items follow the steps in the Knowledge Base article:

Data Aggregator (DA), Data Collector (DC) and Data Repository (DR) intermittently losing connection with CA Performance Management (NetOps Portal)

Additional Information

Do you have excess Retired items but don't see the problematic excess Response Path Test items? The steps in the following Knowledge Base article can be used to clean up the excess Retired items.

Data Aggregator (DA), Data Collector (DC) and Data Repository (DR) intermittently losing connection with CA Performance Management (NetOps Portal)

Long term recommendation until upgrading to a release with solutions is to schedule the remove_not_present_items.sh script. Run it on a nightly basis to maintain a healthy system.