Data Aggregator (DA), Data Collector (DC) and Data Repository (DR) intermittently losing connection with NetOps Portal

Products

CA Performance Management Network Observability

Issue/Introduction

The DA, DC and DR intermittently loose connection with NetOps Portal after which the DA must be restarted to restore the connection. However, connectivity is lost once more after a short while.

Environment

DX NetOps Performance Management 20.2 or later

Cause

Check on both the DA and DR and see if the CPU usage is very high.

For example, on the DA, running the top utility, if the java process for the dadaemon is running at 100% or greater (multi-core), then it shows that the DA is operating under high load:

top - 11:40:14 up 39 days, 14:42, 4 users, load average: 3.35, 3.37, 3.96

Tasks: 297 total, 1 running, 296 sleeping, 0 stopped, 0 zombie

%Cpu(s): 69.7 us, 0.9 sy, 0.0 ni, 29.0 id, 0.2 wa, 0.0 hi, 0.1 si, 0.0 st

KiB Mem : 32782064 total, 341364 free, 27948844 used, 4491856 buff/cache

KiB Swap: 0 total, 0 free, 0 used. 4106832 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

24715 root 20 0 29.249g 0.023t 17100 S 420.5 76.9 133:38.37 java

18334 root 20 0 10.739g 1.077g 14932 S 2.3 3.4 47:14.08 java

937 root 20 0 4368 588 496 S 0.7 0.0 603:01.14 rngd

Further, on the DR, the Vertica process is seen to be running at High CPU utilisation:

Tasks: 304 total, 1 running, 303 sleeping, 0 stopped, 0 zombie

%Cpu(s): 0.0 us, 2.9 sy, 5.5 ni, 91.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

KiB Mem : 14837955+total, 16357184 free, 17397312 used, 11462504+buff/cache

KiB Swap: 3907580 total, 3907320 free, 260 used. 12707118+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

233736 dradmin 20 0 0.172t 0.015t 51512 S 100.3 10.7 48:02.54 vertica

1 root 20 0 191584 4376 2164 S 0.0 0.0 1:03.53 systemd

2 root 20 0 0 0 0 S 0.0 0.0 0:00.48 kthreadd

Even though both are up and running and showing no related errors in the logs, it is clear they're under stress. So check for the number of items in Vertica:

dauser=> select count(*) from v_item_facet;

count

----------

36343408

(1 row)

dauser=> select count(*) from v_item_facet where facet_qname LIKE '%}Retired';

count

---------

5840506

(1 row)

In the above example, there is over 36 million items, of which there are 5.8 million+ retired. This is a large number that can place Vertica (DR) under high performance load which in turn cascades to the DA when it attempts to load these items into memory. As such, both DA & DR (and by extension, the GCs as they communicate through the DA) are too busy processing the large number of items to respond to the NetOps Portal when it attempts to connect with the DA.

Resolution

Stop the DA:

service dadaemon stop
service activemq stop

Then delete the data directory in the DA under the apache-karaf-<version> directory. Sample path in a default install is:

/opt/CA/IMDataAggregator/apache-karaf-4.2.6/data

Then on the DR, delete all the retired items as per:

Export the irep backup (done as a precaution in case a mistake is made removing retired (not present) items). Go to the directory on the DR where the installer was originally expanded to, by default (where the drinstall.properties file is). This will export the iRep data to /tmp/iExport:

cd /opt/CA/IMDataRepository_vertica10

./caVerticaUtility.sh -u dauser -w <dapass> -s dauser -d /tmp/iExport -e
Login to vsql as dauser

cd /opt/vertica/bin/

./vsql -Udauser -wdapass
At the vsql prompt, run the following queries

create table dauser.items_to_delete as select item_id from dauser.v_item_facet where facet_qname like '%}Retired';

delete from dauser.item where item_id in (select item_id from dauser.items_to_delete);

commit;

drop table dauser.items_to_delete;
Transfer the attached 1591661036513__cleanupDeletedItems.zip with the script in it to the DR DB. Run these steps as root.
1. Set it up with the following steps. First extract it with the command:
  1. unzip 1591661036513__cleanupDeletedItems.zip
2. Now change permissions to make it executable.
  1. chmod 755 cleanupdeleteditems.sh
3. Run the script as follows, as the db admin user:
  1. Default sample with -c to just show counts of items to be worked on, where schema name is default dauser:
    1. cleanupdeleteditems.sh -u dauser -w dapass -c -s dauser
  2. Then run it as follows without -c to have it perform the necessary cleanup.
    1. cleanupdeleteditems.sh -u dauser -w dapass -s <schemaName>
  3. Can run with -c again to confirm zero items remain for cleanup.
Restart the DA

This will clear up retired items so that the DA no longer has to process them and should allow it to come back up and reconnect, after which it should successfully synchronise.

Attachments

1591661036513__cleanupDeletedItems_1644998849491.zip get_app