Troubleshooting using Visual Statistics Display (VSD) in VMware GemFire
search cancel

Troubleshooting using Visual Statistics Display (VSD) in VMware GemFire

book

Article ID: 293975

calendar_today

Updated On: 03-31-2025

Products

VMware Tanzu Gemfire

Issue/Introduction

This article will help you get started with the Visual Statistics Display (VSD) tool; this article aims to help you understand how to use it in order to troubleshoot VMware GemFire issues. This article applies to a Gemfire environment of 7 or later.

Resolution

Visual Statistics Display (VSD) is a visual tool for analyzing GemFire statistics. It is probably the most important GemFire tool to understand because it is used when either tuning GemFire or troubleshooting most GemFire issues.
 

VSD works by reading GemFire statistics from *.gfs archive files, which are created by GemFire, and it renders their graphs for analysis. It is not a real-time online monitoring tool, such as Pulse. It does not have the real-time monitoring and alerting capabilities that online tools have. On the other hand, it is the most powerful tool for examining the state of a GemFire system. It provides access to a very high number of statistics collected by GemFire which include GemFire, Java, and OS parameters. No real-time monitoring tool can do that as the number of statistics that GemFire collects is prohibitive for real-time collection in a distributed system.

Having a complete view into the state of a GemFire process is what makes VSD an indispensable forensic tool for performance analysis and tracking down problems by performing offline analysis of distributed systems using statistics gathered by the cluster. It is also helpful anytime you need to verify the runtime state of a distributed system. For example, upon startup or data loading, to make sure that all the nodes are present, that they see one another, that all the entries are loaded and balanced across all the nodes, or that JVM heaps have enough headroom.

The number of statistics available for viewing in VSD can be overwhelming. This article will point out some of the most important statistics that are useful in verifying the state of a distributed system, including its configuration, resource usage, and throughput for different operations.


Getting Started with VSD

For some earlier GemFire versions like 7.x, VSD is included with GemFire and is located in the tools subdirectory of the product directory tree. For more current versions, VSD is a separate download off the Pivotal network when pulling down a specific version. A brief user guide is included in the GemFire User's Guide.

An important prerequisite for VSD is that the collection of GemFire statistics must be enabled at runtime. That is accomplished by setting the configuration properties as follows:

statistic-sampling-enabled=true
statistic-archive-file=myStats.gfs


As the collection of statistics at the default sampling rate of one second does not affect performance, it should always be enabled during development, testing, and while in production.
 

Note: It is also possible to enable statistics without the need to bring down the GemFire cluster. This can be done with the gfsh "alter runtime" command.


There is a special category of statistics called time-based statistics that can be very useful in troubleshooting and assessing the performance of some GemFire operations, but they should be used with caution because their collection can affect performance. They can be enabled using the following property:

enable-time-statistics=true


Limit disk space usage

As with log files, it is important to configure statistics rolling to manage disk space usage. To setup rolling of statistics files, use the following parameters:

archive-disk-space-limit=1000
archive-file-size-limit=100


This will cause gfs files to roll when they reach 100MB and keep the last 10 files, reaching a maximum of 1GB of used disk space. File sizes may differ from environment to environment in order to strike the right balance between disk space usage, archiving, and easy handling of files.
 

Analyzing the Data

Once a distributed system is up and running, every GemFire instance will create its own statistics files. The best way of loading these files into VSD is to copy all the state files into one directory and then add them as parameters when launching VSD. To do this it is important to name each servers' statistics files differently. Using the host plus member name is good practice. An important note when looking at statistics and comparing to events from the GemFire logs is that VSD shows the time in the time zone on the machine running VSD and not the time zone in which the statistics and logs were created. Setting the time zone before launching VSD will help in interpreting data and correlating events with log entries. See this article for details on getting the time zone from the gfs files and also on how to create a script for launching VSD with the correct time zone.

Once you have VSD running and statistics archives loaded, it will be populated with an overwhelming amount of metrics.

Make sure that the statistics from all members covers the timeline of when the incident happened. You can achieve this by opening a graph for any of the metrics and then by selecting:
 

  • Chart - Time Format - Month/day

The number of types and parameters in each section is quite overwhelming. Setting "Main - No Flatlines" helps by showing only those parameters that changed value during the time span of the statistics file.


Overview of principal statistics

Begin by taking a look at the Quick Guide to Useful Statistics in the GemFire User's Guide. The following are additional checks to make:
 

Basic health check

Open the type statSampler and the parameter "delayDuration". This should be roughly a straight line showing the sampling rate configured. If there are many deviations from the flat line and these are over 100%, the system is having trouble.

Another important thing shown in statsSampler is jvmPauses. These are not necessarily full Garbage Collection stop-the-world pauses, but a lack of resources that impacts the statsSampler so it cannot collect data.

These events will also get logged with the following message in the member logs:

[warning 2015/01/21 13:39:17.935 CET <Thread-6 StatSampler> tid=0x2e] Statistics sampling thread detected a wakeup delay of 3,173 ms, indicating a possible resource issue. Check the GC, memory, and CPU statistics.



LinuxSystemStats - ioWait

ioWait is another useful health indicator if using persistence. It is a percentage of waiting operations. It should be below 10% if the system is healthy.

Recommendation: Use local disks for persistence instead of network storage. If using network storage, SAN is recommended over NFS.



distributionStats - nodes

Check if there are any nodes going down or up after system startup. It shows the number of known nodes in the distributed system. If it is a flat line it means the node is the last to come up. If not you will see a staircase formed graph.


distributionStats - replyWaitsInProgress

This can go up and down. It is a problem if it doesn't come down to zero. In this case, it is waiting for acknowledgment from another member, so you should look for the member that it is waiting for. If there are nodes stuck at a non-zero value, you will need thread dumps from these members to figure out is deadlocked.

 

ParNew

ParNew collections should occur roughly one per second to every 15 seconds as guidelines. More than one parNew collection per second is bad. Collection time should be low a percentage of total time.

 

CMSOldGen - Heapmemory

Check the metrics, currentMaxMemory, and currentUsedMemory, under CMSOldGen-Heapmemory. If this just climbs continually, there is no garbage created or something is broken with GC.



CPU usage

For LinuxSystemStats - cpuActive, check against cpuUser and cpuSystem to determine if the CPU is used by GemFire or a third party process.

 

LinuxSystemStats - contextSwitches

It is a bad sign if CPU usage is high when contextSwitches is high.

  • diskTime
  • diskTimeInProgress

If you are at 750 milliseconds with 4 CPUS, you are using 20% of CPU on Disk I/O. How many CPUs does the host have? Check vmStats - cpus to check how many CPUs are on a system



LinuxSystemStats - loadAverage

Shows how many threads are running concurrently


LinuxSystemStats - freeMemory-linuxStats

Get an idea of how much you can increase your heap for GemFire. Should start at physicalMemory and go down. If it starts below physical then you have other processes using memory besides OS and GemFire. How many members are running on the same host or VM? Maybe there are several members fighting for memory.


DiskRegionStatistics - xxxxxCache

  • entriesInVM
  • entriesOnlyOnDisk


Show all (Chart - Show Legend (turn off))

Note that when an entry is evicted to disk, there is still the key and a map to the value stored in memory so GemFire can retrieve the value from the disk. If the keys are relatively big in comparison to the values, eviction will not free up much space.

 

CacheServerStats - currentClientConnections

Check these overall members to see if a load is evenly spread across the cluster.

 

Additional Information

https://community.pivotal.io/s/article/How-to-Open-Statistics-Archive-Files-using-a-Specific-Time-Zone-in-Visual-Statistics-Display