GemFire: What to do if you see "thread is stuck" and "has been stuck" log messages

search cancel

GemFire: What to do if you see "thread is stuck" and "has been stuck" log messages

book

Article ID: 294441

calendar_today

Updated On:

Products

VMware Tanzu Gemfire

Issue/Introduction

This article discusses what to do if you see messages related to "stuck" threads similar to the following:

[warn 2020/01/08 10:53:39.999 EST <ThreadsMonitor> tid=0x32] Thread 2625 (0xa41) is stuck [warn 2020/01/08 10:53:40.005 EST <ThreadsMonitor> tid=0x32] Thread <2625> (0xa41) that was executed at <08 Jan 2020 10:53:02 EST> has been stuck for <37.14 seconds> and number of thread monitor iteration <1>  Thread Name <Pooled Waiting Message Processor 329> state <TIMED_WAITING> Waiting on <java.util.concurrent.CountDownLatch$Sync@200baf67> Executor Group <PooledExecutorWithDMStats> Monitored metric <ResourceManagerStats.numThreadsStuck> Thread stack: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277) org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.awaitWithCheck(StoppableCountDownLatch.java:120) org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:93) org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:692) org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802) org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779) org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865) org.apache.geode.internal.cache.BucketAdvisor.sendProfileUpdate(BucketAdvisor.java:1618) org.apache.geode.internal.cache.BucketAdvisor.acquiredPrimaryLock(BucketAdvisor.java:1189) org.apache.geode.internal.cache.BucketAdvisor.acquirePrimaryRecursivelyForColocated(BucketAdvisor.java:1305) org.apache.geode.internal.cache.BucketAdvisor.access$700(BucketAdvisor.java:84) org.apache.geode.internal.cache.BucketAdvisor$VolunteeringDelegate.doVolunteerForPrimary(BucketAdvisor.java:2530) org.apache.geode.internal.cache.BucketAdvisor$VolunteeringDelegate$$Lambda$682/831346815.run(Unknown Source) org.apache.geode.internal.cache.BucketAdvisor$VolunteeringDelegate.lambda$consumeQueue$0(BucketAdvisor.java:2728) org.apache.geode.internal.cache.BucketAdvisor$VolunteeringDelegate$$Lambda$684/491870748.run(Unknown Source) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) org.apache.geode.distributed.internal.ClusterOperationExecutors.runUntilShutdown(ClusterOperationExecutors.java:442) org.apache.geode.distributed.internal.ClusterOperationExecutors.doWaitingThread(ClusterOperationExecutors.java:411) org.apache.geode.distributed.internal.ClusterOperationExecutors$$Lambda$172/1606698192.invoke(Unknown Source) org.apache.geode.logging.internal.executors.LoggingThreadFactory.lambda$newThread$0(LoggingThreadFactory.java:119) org.apache.geode.logging.internal.executors.LoggingThreadFactory$$Lambda$168/1351560056.run(Unknown Source) java.lang.Thread.run(Thread.java:748)

If you are experiencing issues in your environment and you see log messages like the above "stuck" message, it warrants further investigation. That said, it is often nothing to be concerned about. If you are not experiencing other symptoms that are negatively impactful, such as members getting kicked out of the distributed system, it is likely nothing. As stated, if you are only seeing these log messages and not exhibiting any other negative symptoms, it is very likely nothing to be concerned with.

The "stuck" message could be related to some longer repetitive task that is going through the same piece of code over and over for each bucket and for this reason, seems to be "stuck" when, in fact, good progress is being made.

Key things to look for to determine whether you have a real issue:

If members have been kicked out of the cluster and you see these messages, it likely warrants a new ticket for us to assess.
If you see the same thread being stuck for a long time, evidenced by seeing many <iterations> of the same thread being stuck, then it warrants again deeper analysis.
You will see the full stack for each of these iterations for a given stuck thread. If you see the same stack, stuck at the same place in the code for each iteration, then this should probably drive a new ticket, even if no impact. If you see the stack changing, then it is not really a "stuck" thread, so you can likely disregard the log message unless you see other negative symptoms.

The purpose of this article is to prevent the need to open tickets when there is no need for concern. If you have no negative symptoms and you are mostly just questioning what these messages mean, then you have likely no need for concerns.

You can analyze the logs more deeply as stated above to assess whether it warrants a new ticket.

There are many causes for the log messages, and many of them are no need for concern.

It does help to have these messages in cases where a customer has a real issue and did not gather thread dumps when it would have been prudent to do so. In such cases, it helps if we find some "has been stuck" messages in the logs, that may help to serve as a substitute for the full thread dumps.

Please understand that they are not nearly as useful as a thread dump, but it is better than nothing when pointing the focus of an investigation in the right direction.

Environment

Product Version: 9.1
OS: Any

Resolution

The resolution is as simple as the following.

Take no action if you see no other symptoms and see the stuck thread log messages making progress over the many iterations.

If you do not see many iterations and the message is a one or two time message that goes away, you almost certainly have no need for concern.

If you do lose a member, corresponding in close proximity to the time of the "thread x is stuck" message or the "has been stuck" message, then open a Support ticket and we can evaluate the health of your environment focusing on this specific issue.

Checklist:
Key things to look for to determine whether you have a real issue:

If members have been kicked out of the cluster and you see these messages, it likely warrants a new ticket for us to assess.
If you see the same thread being stuck for a long time, evidenced by seeing many <iterations> of the same thread being stuck, then it warrants again deeper analysis.
You will see the full stack for each of these iterations for a given stuck thread. IF you see the same stack, stuck at the same place in the code for each iteration, then this should probably drive a new ticket, even if no impact. IF you see the stack changing, then it is not really a "stuck" thread, so you can likely disregard the log message unless you see other negative symptoms.

Feedback

thumb_up Yes

thumb_down No