The DA, DC and DR intermittently loose connection with NetOps Portal after which the DA must be restarted to restore the connection. However, connectivity is lost once more after a short while.
DX NetOps Performance Management 20.2 or later
Check on both the DA and DR and see if the CPU usage is very high.
For example, on the DA, running the top utility, if the java process for the dadaemon is running at 100% or greater (multi-core), then it shows that the DA is operating under high load:
top - 11:40:14 up 39 days, 14:42, 4 users, load average: 3.35, 3.37, 3.96
Tasks: 297 total, 1 running, 296 sleeping, 0 stopped, 0 zombie
%Cpu(s): 69.7 us, 0.9 sy, 0.0 ni, 29.0 id, 0.2 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 32782064 total, 341364 free, 27948844 used, 4491856 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 4106832 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24715 root 20 0 29.249g 0.023t 17100 S 420.5 76.9 133:38.37 java
18334 root 20 0 10.739g 1.077g 14932 S 2.3 3.4 47:14.08 java
937 root 20 0 4368 588 496 S 0.7 0.0 603:01.14 rngd
Further, on the DR, the Vertica process is seen to be running at High CPU utilisation:
Tasks: 304 total, 1 running, 303 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 2.9 sy, 5.5 ni, 91.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 14837955+total, 16357184 free, 17397312 used, 11462504+buff/cache
KiB Swap: 3907580 total, 3907320 free, 260 used. 12707118+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
233736 dradmin 20 0 0.172t 0.015t 51512 S 100.3 10.7 48:02.54 vertica
1 root 20 0 191584 4376 2164 S 0.0 0.0 1:03.53 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.48 kthreadd
Even though both are up and running and showing no related errors in the logs, it is clear they're under stress. So check for the number of items in Vertica:
dauser=> select count(*) from v_item_facet;
count
----------
36343408
(1 row)
dauser=> select count(*) from v_item_facet where facet_qname LIKE '%}Retired';
count
---------
5840506
(1 row)
In the above example, there is over 36 million items, of which there are 5.8 million+ retired. This is a large number that can place Vertica (DR) under high performance load which in turn cascades to the DA when it attempts to load these items into memory. As such, both DA & DR (and by extension, the GCs as they communicate through the DA) are too busy processing the large number of items to respond to the NetOps Portal when it attempts to connect with the DA.
Stop the DA:
service dadaemon stop
service activemq stop
Then delete the data directory in the DA under the apache-karaf-<version> directory. Sample path in a default install is:
/opt/CA/IMDataAggregator/apache-karaf-4.2.6/data
Then on the DR, delete all the retired items as per:
cd /opt/CA/IMDataRepository_vertica10
./caVerticaUtility.sh -u dauser -w <dapass> -s dauser -d /tmp/iExport -e
cd /opt/vertica/bin/
./vsql -Udauser -wdapass
create table dauser.items_to_delete as select item_id from dauser.v_item_facet where facet_qname like '%}Retired';
delete from dauser.item where item_id in (select item_id from dauser.items_to_delete);
commit;
drop table dauser.items_to_delete;
chmod 755 cleanupdeleteditems.sh
This will clear up retired items so that the DA no longer has to process them and should allow it to come back up and reconnect, after which it should successfully synchronise.