Data aggregator doesn't start anymore
Software keep printing on the log "karaf attribute loading is still in progress"
karaf.log, many instances of the following errors:
ERROR | p1233910105-1525 | 2020-09-10 21:39:58,794 | ODataQueryAuthenticationFilter | odata.filters.AOpenAPIBaseFilter 69 | data-services.odataquery | | ODataQueryAuthenticationFilter: Authentication Service is not available
And
INFO | p1233910105-1646 | 2020-09-10 21:56:01,354 | WebServiceExceptionMapper | t.impl.WebServiceExceptionMapper 55 | .ca.im.web-services.impl | | A web service error has occurred: Unable to find the specified URL[metricFamilies]
This essentially means the issue is on the DR, with Vertica under excessive load and being unable to respond in a timely manner to the DA.
Restart the DR (Vertica) via adminTools:
As dradmin user:
/opt/vertica/admintools
Then stop DB
Then you can stop the DA (including activemq). After that, I'd like you to delete the karaf data directory:
rm -rf /opt/IMDataAggregator/apache-<VERSION>/data
Then restart, first the activemq then the dadaemon.
DA logs shows Attribute loading is still in progress - Load time: 1:22:19.911
After 3hours the cpu usage get heavy usage and failed with below exception
ERROR | xtenderThread-97 | 2020-09-11 00:07:33,379 | ExceptionLog | .ca.im.core.util.ExceptionLogger 99 | m.ca.im.common.core.util | | A NEW application exception occurred (Key=d55cc72aad3b152ebb3002de3d4415b6ff3959d8)
: loadItems failed : Failed to load attribute data.
com.ca.im.item.db.DbException: Failed to load attribute data.
Requested the outputs of DR diagnostics vcpuperf, vnetperf and VIOPerf data
Two Issues are found
1. Check the Latency between DR Nodes and also DA
The maximum recommended rtt latency is 2 milliseconds
but as per the output it has 82 and 83 seconds
Also clock skew should be less than 1 second
but as per the output it has 4 seconds
test | date | node | index | rtt latency (us) | clock skew (us)
-------------------------------------------------------------------------------------------------------------------------
latency | 2020-09-11_04:30:47,249 | xx.xx.xx.66 | 0 | 82868 | 4494
latency | 2020-09-11_04:30:47,249 | xx.xx.xx.67 | 1 | 83884 | 4200
latency | 2020-09-11_04:30:47,249 | xx.xx.xx.68 | 2 | 40 | 2
2.TCP Throught put should be minimum 100MB/s but in the VNetPerf output max it touched is 56 MB/s
2020-09-11_04:31:07,816 | tcp-throughput | 256 | average | 56.0467 | 55.8549 | 77332480 | 77277866 | 1.3228
Release : Any CAPM version
Component : IM Data Aggregator
Communication is getting affected and connection to the DR servers are failing
These need to be fixed from network end and bring them to the recommended levels