Data aggregator Down - Not able to start - karaf attribute loading is still in progress - A web service error has occurred: Unable to find the specified URL[metricFamilies]

book

Article ID: 199496

calendar_today

Updated On:

Products

CA Infrastructure Management CA Performance Management - Usage and Administration DX NetOps

Issue/Introduction

Data aggregator doesn't start anymore 

Software keep printing on the log "karaf attribute loading is still in progress"

 

karaf.log, many instances of the following errors:

ERROR | p1233910105-1525 | 2020-09-10 21:39:58,794 | ODataQueryAuthenticationFilter | odata.filters.AOpenAPIBaseFilter   69 | data-services.odataquery |       | ODataQueryAuthenticationFilter: Authentication Service is not available

And 

INFO  | p1233910105-1646 | 2020-09-10 21:56:01,354 | WebServiceExceptionMapper | t.impl.WebServiceExceptionMapper   55 | .ca.im.web-services.impl |       | A web service error has occurred: Unable to find the specified URL[metricFamilies]

This essentially means the issue is on the DR, with Vertica under excessive load and being unable to respond in a timely manner to the DA.

Restart the DR (Vertica) via adminTools:

As dradmin user:

/opt/vertica/admintools

Then stop DB

Then you can stop the DA (including activemq). After that, I'd like you to delete the karaf data directory:

rm -rf /opt/IMDataAggregator/apache-<VERSION>/data

Then restart, first the activemq then the dadaemon.

DA logs shows Attribute loading is still in progress - Load time: 1:22:19.911

After 3hours the cpu usage get heavy usage and failed with below exception
ERROR | xtenderThread-97 | 2020-09-11 00:07:33,379 | ExceptionLog | .ca.im.core.util.ExceptionLogger   99 | m.ca.im.common.core.util |       | A NEW application exception occurred (Key=d55cc72aad3b152ebb3002de3d4415b6ff3959d8)

: loadItems failed : Failed to load attribute data.

com.ca.im.item.db.DbException: Failed to load attribute data.

 

Requested the outputs of DR diagnostics vcpuperf, vnetperf and VIOPerf data

Two Issues are found

 

1. Check the Latency between DR Nodes and also DA

The maximum recommended rtt latency is 2 milliseconds

but as per the output it has 82 and 83 seconds

 

Also clock skew should be less than 1 second

but as per the output it has 4 seconds

 

test | date | node | index | rtt latency (us) | clock skew (us)

 -------------------------------------------------------------------------------------------------------------------------

 latency | 2020-09-11_04:30:47,249 | 10.121.206.66 | 0 | 82868 | 4494 

latency | 2020-09-11_04:30:47,249 | 10.121.206.67 | 1 | 83884 | 4200 

latency | 2020-09-11_04:30:47,249 | 10.121.206.68 | 2 | 40 | 2

 

 

2.TCP Throught put should be minimum 100MB/s but in the VNetPerf output max it touched is 56 MB/s

 

2020-09-11_04:31:07,816 | tcp-throughput    | 256               | average          | 56.0467     | 55.8549     | 77332480            | 77277866            | 1.3228 

 

 

 

Environment

Release : Any CAPM version

Component : IM Data Aggregator

Resolution

Communication is getting affected and connection to the DR servers are failing

These need to be fixed from network end and bring them to the recommended levels