Issue:
Both primary and secondary MOM EMs are accessing the APM database which is causing deadlock errors in MOM EM logs.
[ERROR] [QuartzScheduler_Worker-2] [org.quartz.core.JobRunShell] Job DEFAULT.jobDetailBean threw an unhandled Exception:
org.springframework.scheduling.quartz.JobMethodInvocationFailedException: Invocation of method 'pruneData' on target class [class com.wily.apm.model.pruning.DataPruner] failed; nested exception is org.springframework.jdbc.UncategorizedSQLException: CallableStatementCallback; uncategorized SQLException for SQL [{? = call PRUNE_APM_DATA(?, ?)}]; SQL state [72000]; error code [20999]; ORA-20999: An error - -60 Error Msg - ORA-00060: deadlock detected while waiting for resource
ORA-06512: at "APM.PRUNE_APM_DATA", line 71
Environment:
This issue occurred with APM 9.7 primary and secondary MOM EMs. This problem may also happen in APM 10.0, 10.1 and 10.2.
Cause:
Under certain conditions if observing some type of network issues accessing the EM shared resources (e.g. EM_HOME/config/internal/server/primary_em.lck and/or EM_HOME/config/internal/server/secondary_em.lck),
you can find that secondary MOM is able to acquire primary lock (primary_em.lck), and fully start.
During the same time you can also find that primary MOM is still up and running.
Resolution:
After seeing any type of network issues related to accessing EM shared resources, observe the output when starting the secondary MOM. The following messages will be displayed when starting the secondary MOM.
4/20/16 09:00:07.447 AM EDT [INFO] [main] [Manager] Starting Introscope Enterprise Manager...
4/20/16 09:00:07.914 AM EDT [INFO] [main] [Manager.HotFailover] The Introscope Enterprise Manager is configured as a Secondary EM
4/20/16 09:00:07.916 AM EDT [INFO] [main] [Manager.HotFailover] Acquiring primary lock...
At this point secondary MOM goes to wait mode regularly checking the access status of primary_em.lck. The primary MOM start process should not move forward until secondary MOM acquires primary_em.lck.
Note: A typical implementation of MOM failover on multiple hosts shares a single complete EM installation on a high-available (HA) Network Attached Storage (NAS) device. e.g. NFS, SMB.
It is highly recommended that EM shared resources should be high-available (HA) e.g. shared disk, EM installation location etc.