The goal of this article is to promote proactive monitoring instead of reactive monitoring (taking action after experiencing business impact). This article shares keywords which assist in the monitoring of GemFire logs.
This article can be used in conjunction with the Log Messages and Solutions troubleshooting guide provided in our online documentation.
GemFire keywords found in logs can be helpful in proactively assessing health in order to determine whether action is needed to keep the cluster stable. By examining the logs regularly or by setting up scripts to analyze all GemFire logs generated (searching for these keywords), you can often catch issues prior to negatively impacting the distributed system.
GemFire logs capture many informative messages at various degrees of severity such as severe, error, warning, config, info, etc. Independent of the severity, there are a number of log messages with very specific keywords that can help you prevent a business impact if recognized and acted upon.
Listed below are a set of words, most of them completely unique, that you can search for in GemFire logs to protect your systems even if you are not observing any negative impact.
- wakeup
- This is extremely unique with no false positives. If you see this message, it indicates that you are having JVM Pauses which means that GemFire is not running in your system for the amount of time shown in the log message. This requires action and tuning. Such pauses, when sufficiently long, can cause members of GemFire to be kicked out of the distributed system.
- elapsed
- This is extremely unique with no false positives. If you see this message, it indicates that the peer to peer communication between members of the distributed system is impacted to some degree, causing delays in replies from one member to another. This can snowball, so therefore it warrants full understanding or else members can be kicked out.
- above
- This is extremely unique with no false positives. If you see this message, it indicates that you are seeing high heap consumption above the eviction threshold and potentially above the critical threshold. This is not optimal and means you are in need of tuning. Your data could be evicted and if the system goes above critical, members can be kicked out. Realizing this early could prevent a system / business impact.
- suspect
- suspect messages often precede members getting kicked out of the cluster. Sometimes the system realizes that a member is "no longer suspect" and the system's normal processing continues. However, in some cases, a member may get kicked out if it continues to be suspect and unresponsive for a long enough period of time. If you monitor the system for such suspect messages and understand the cause, you can tune the system, find the root cause, and subsequently prevent a negative impact.
- crashed
- Most of the time, customers realize when a member has crashed or has been kicked out. However, given the GemFire auto-reconnect feature, this is not always the case. Sometimes customers are completely unaware until some observable client impact or business impact. It helps to understand why a member has crashed in order to prevent a re-occurrence so that future crashes, or multiple crashes, that may impact the cluster do not occur.
- quorum
- This is another completely unique word and it means that the distributed system may be severely impacted, may be experiencing some major outage, or may even be experiencing a complete outage driven by either some network issues and/or the failure of a majority of the GemFire membership. It's important to understand how best to recover from this. Generally, this warrants a deep root cause analysis.
- tenured
- This is a new unique word in the logs to communicate how much tenured heap is in use (LIVE) immediately after a tenured collection. Using this message is the most accurate way to assess real tenured heap consumption. By monitoring such messages, you can catch potential heap growth, very early on, that is not fully understood. For example, if the number of GemFire entries is remaining mostly constant in the system, but the heap(s) continue to grow over time, it warrants deeper analysis to determine if the system is experiencing a very early leak. This can prevent a bad system impact when recognized early.
- severe
- This is important and such occurrences need to be investigated. However, the word "severe" is not unique to GemFire logs, there are configuration properties with "severe" in the name. Advanced scripts can filter out such occurrences (ack-severe-alert-threshold) and capture only real issues that need to be investigated.
- fatal
- As the name implies, urgent attention should be given to any occurrences of fatal issues. Unfortunately, it doesn't really give advanced notice to protect the system but perhaps recognizing these fatal issues in the logs as soon as possible can prevent a snowball impact or even worse incidents.
- exhaust
- This actually helps to catch both "exhaustion", a GemFire specific message related to hitting thread limits, and also "exhausted", which is found in many GC logs when having heap space issues. Both of these definitely require tuning, configuration, and deeper root cause analysis to determine what could be driving the behavior.
- exception
- This will catch many things and some of them may potentially be not worthy of concern but as you find them, you can then filter any scripts to be more specific for your environment. For example, you could add ForcedDisconnectException or LowMemoryException. These two are relatively common and can help you become aware of issues brewing that demand analysis. If you discover other exceptions occurring in your environment, you can add more specific terms as needed to any scripts being used to scour the logs.
- warning
- This is similar to severe, fatal, & error. These may not be unique unless you add a bracket prior to the word. For example: "[severe", "[error", "[warning". The bracket is part of the GemFire log formatting and will help filter out other uses of those terms that are not of concern.
- heartbeat
- This can be found in multiple log messages, all of which are worthy of investigation. One can be related to having members be unresponsive, tying it to the wakeup term described above. Another case is when clients are exceeding timeouts which is also worthy of some proactive work if this begins to occur in your environment.
Symptoms:
The symptoms involve seeing any occurrences of the keywords listed above in your logs; each such instance should be addressed accordingly.