search cancel

Unable to start APM after server crash (full disk)

book

Article ID: 242920

calendar_today

Updated On:

Products

CA Application Performance Management (APM / Wily / Introscope)

Issue/Introduction

We had a server crash on our PROD On-Prem MOM last night.

Server was restarted but I am unable to get the MOM restarted via EMCtl - it is hanging at the following log entry:

\(ms\) OR <all>   ((.*)\|(.*)\|(.*))   By Business Service\|[^|]+\|[^|]+\|.*:Average Response Time \(ms\) OR <all>   ((.*)\|(.*)\|(.*))   ASP.NET\|[^|]+:Average Response Time \(ms\) OR <all>   ((.*)\|(.*)\|(.*))   WebServices\|Server\|[^|]+\|[^|]+:Average Response Time \(ms\) OR <all>   ((.*)\|(.*)\|(.*))   Business Segment\|[^|]+\|[^|]+\|.*:Average Response Time \(ms\) OR <all>   ((.*)\|(.*)\|(.*))   Backends\|WebService at [^|]+\|Paths\|[^|]+:Average Response Time \(ms\) OR <all>   ((.*)\|(.*)\|(.*))   Backends\|System [^|]+ on port [^|]+:Average Response Time \(ms\) OR <all>   ((.*)\|(.*)\|(.*))   Frontends\|Messaging Services \(onMessage\)\|[^|]+\|[^|]+:Average Response Time \(ms\) OR <all>   ((.*)\|(.*)\|(.*))   Frontends\|Messaging Services \(receive\)\|[^|]+\|[^|]+:Estimated Message Processing Time \(ms\) OR <all>   ((.*)\|(.*)\|(.*))   Backends\|Messaging Services \(outgoing\)\|[^|]+\|[^|]+:Average Response Time \(ms\) OR <all>   ((.*)\|(.*)\|(.*))   Backends\|Messaging Services \(receive\)\|[^|]+\|[^|]+:Average Response Time \(ms\) OR <all>   ((.*)\|(.*)\|(.*))   Backends\|[^|]+\.+[^|]+\.+[^|]+:Average Response Time \(ms\) OR <all>   ((.*)\|(.*)\|(.*))   Automatic Entry Points\|[^|]+\|[^|]+:Average Response Time \(ms\) OR <all>   ((.*)\|(.*)\|(.*))   Backends\|[^|]*:Average Response Time \(ms\) OR <all>   ((.*)\|(.*)\|(.*))   MVC\|Controllers\|[^|]+:Average Response Time \(ms\) OR <all>   ((.*)\|(.*)\|(.*))   WebAPI\|Controllers\|[^|]+:Average Response Time \(ms\) AND [email protected]7), cautionThresholdLevel=5, dangerThresholdLevel=4, weRules=[true, true, true, true], windowCellCount=20, windowCellDecay=20, ruleEngineClass=com.ca.apm.baseline.alert.rules.wer.CustomWERuleEngine, cautionThreshold=1650, dangerThreshold=3960, extremeDangerThreshold=10000, weRuleWeights=[300, 300, 200, 100]]] on new collector [email protected]
[INFO] [ClusterManager Async Executor] [Manager.AppMap.RemoteHttp] Register RemoteHttpCall Message Service for Collector-8
[INFO] [ClusterManager Async Executor] [Manager.AppMap.TokenValidator] Register TokenValidator Message Service for Collector-8
INFO] [ClusterManager Async Executor] [Manager.AppMap.SecureStore] Register SecureStore Message Service for Collector-8
[INFO] [ClusterManager Async Executor] [Manager.AppMap] Registered with [email protected] for event consumption
[INFO] [ClusterManager Async Executor] [Manager.AppMap.GeoLocation] Register GeoLocation Message Service for Collector-8
[INFO] [ClusterManager Async Executor] [Manager.com.wily.apm.tess.isengard.TransactionTraceReverseProxyBean] TransactionTraceReverseProxyBean::collectorsAdded() Register Message Service for Collector-8
[INFO] [ClusterManager Async Executor] [Manager.com.wily.apm.tess.isengard.messaging.TessMessagePublisherReverseProxyBean] Register Message Service for Collector-8
[WARN] [pool-24-thread-5] [Manager.EemRealm] Absolute file path for logger configuration not set in "eiam.config"
[INFO] [pool-24-thread-5] [Manager.EemRealm] EEM SDK initialized in non-FIPS mode
[INFO] [pool-24-thread-5] [Manager.EemRealm] "EEM-US" realm attached to application "APM-Prod" in EEM server at "ussalxapps171p.cotyww.com" using external directory
[INFO] [TimerBean] [Manager] Successfully distributed transaction trace error filter rules to all collectors
[INFO] [TimerBean] [Manager] Successfully distributed transaction trace error filter rules to all collectors

 

 

Cause

Likely corrupt Smartstor or traces files.

Environment

Release : 10.7.0

Component :

Resolution

Cleared /data and /traces directory. EM restarted.

Additional Information

Crash happened  around 6 P.M. last night. Cleared disk space around midnight. Removed large Hprof file.

Got on a webex and did the following:
1) Cleared OSGI cache . Got a little further
2) Rebooted server
3) Cleared data and traces directory
4) No guid lock file 
5) Restarted EM, it came up