search cancel

Clarity - Hanging threads and high CPU utilization in process engine due to java.util.HashMap infinite loop problem

book

Article ID: 42239

calendar_today

Updated On:

Products

Clarity PPM SaaS Clarity PPM On Premise

Issue/Introduction

BG services where the process engine is running can appear to have high CPU utilization intermittently that will persist until the service is restarted.
 

Thread dumps captured with the JDK tool called jstack will reveal that multiple threads are stuck in java.util.HashMap.getEntry() at the same point without synchronization locking being present.

 
E.g. Each search result is a separate thread (process action or pipeline or custom script execution thread):
 
C:\HeapDumps\bg_threaddump_201602032203.txt (9 hits)
Line 79: at java.util.HashMap.getEntry(HashMap.java:446)
Line 392: at java.util.HashMap.getEntry(HashMap.java:446)
Line 408: at java.util.HashMap.getEntry(HashMap.java:446)
Line 424: at java.util.HashMap.getEntry(HashMap.java:446)
Line 440: at java.util.HashMap.getEntry(HashMap.java:446)
Line 456: at java.util.HashMap.getEntry(HashMap.java:446)
Line 472: at java.util.HashMap.getEntry(HashMap.java:446)
Line 497: at java.util.HashMap.getEntry(HashMap.java:446)
Line 535: at java.util.HashMap.getEntry(HashMap.java:446)


Specific thread stack trace showing an example from the thread dump file:

"Event Handler pool-3-thread-13" prio=10 tid=0x00000000088b1800 nid=0x657a runnable [0x0000000040266000]
 java.lang.Thread.State: RUNNABLE
at java.util.HashMap.getEntry(HashMap.java:446)
at java.util.HashMap.get(HashMap.java:405)
at com.niku.bpm.utilities.BpmUtils.getLoggedinSecurityIdentifier(BpmUtils.java:110)
at com.niku.bpm.eventmgr.ObjectEventHandler.processEventToAutoStartProcesses(ObjectEventHandler.java:137)
at com.niku.bpm.eventmgr.ObjectEventHandler.fireEvent(ObjectEventHandler.java:63)
at com.niku.bpm.eventmgr.messageserver.BaseEventHandler.fireEvent(BaseEventHandler.java:27)
at com.niku.bpm.eventmgr.messageserver.BaseEventHandler.run(BaseEventHandler.java:77)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
 Locked ownable synchronizers:
- <0x00000007825e2770> (a java.util.concurrent.ThreadPoolExecutor$Worker)
 
 

Steps to Reproduce:

  1. Create a lot of process instances for several (e.g. 100+) users in a short time frame.
  2. Monitor the threads (thread dump) of the bg service and look for an accumulation of java.util.HashMap.getEntry() calls all sticking on the same line.
  3. Monitor the CPU utilization of the machine where the bg service runs.
  4. Capture a thread dump using the jstack command from the JDK at the command prompt: jstack -l <pid of hanging service>

 

Expected Result:  CPU utilization will not spike/thrash at 95% and higher, and threads will not appear stuck on the same line in java.util.HashMap.getEntry().

Actual Result: Intermittently, CPU can spike and threads will hang, and the condition will not typically resolve itself without a service restart.

Environment

Release: All 

Cause

Caused by CLRT-79908

Resolution

Resolution:

This defect will not be fixed at this time.  If the problem occurs for our customers on any currently supported version of Clarity, please raise a support ticket referencing this defect (CLRT-79908) or knowledge article, and the defect will be able to be reviewed again.

Workaround:

Restart the bg services when possible. It is unlikely that much else will run before the restart takes place anyway, as most CPU cycles are exhausted in an infinite loop spin.

Additional Information:

In the stack traces, the top two lines contain the following in multiple threads as the indicator of this problem:

at java.util.HashMap.getEntry(HashMap.java:446)
at java.util.HashMap.get(HashMap.java:405)

The line numbers for java.util.HashMap() correspond to the use of Oracle JDK 1.7.0_21 and other versions and possibly even other operating systems where the bg is running could reveal a different line number in use accordingly, but will still be for the same problem and cause.