Gemfire: Node forced out of cluster running on vSphere
search cancel

Gemfire: Node forced out of cluster running on vSphere

book

Article ID: 293973

calendar_today

Updated On:

Products

VMware Tanzu Gemfire

Issue/Introduction

A server node is forced out of a GemFire cluster running on vSphere. The GemFire logs do not show any wakeup delays, and checking the statistics and GC logs, there are no long GC pauses that can explain the node being unresponsive and, therefore, forced out.

Environment

All supported GemFire Versions.

Cause

In certain configurations on vSphere, a GemFire node may be forced out of the cluster when vMotion or vSphere Snapshots temporarily make the virtual machine unresponsive.

Several conditions may result in this behavior:

  • During periods of elevated workload, if the member targeted for vMotion is hosting the primary bucket for write operations, the server node can exceed the configured 15‑second member-timeout. After two failed health checks, the cluster determines the member is unresponsive and removes it from cluster membership.  
  • Another possible cause is that vSphere Snapshots are enabled. Snapshots and other third‑party backup mechanisms are not supported with GemFire, as these operations can temporarily freeze the virtual machine and block I/O activity. When this occurs, the GemFire member becomes temporarily unavailable to the cluster, leaving no diagnostic traces in GemFire logs or statistics.

Resolution

As documented in the Improving Performance on vSphere, vMotion must be disabled when running GemFire. vMotion may cause nodes to appear unresponsive during migration, resulting in the member being forced out of the cluster.

It is also recommended to disable vSphere Snapshots



Additional Information

 

  • A future GemFire release will include an enhancement to support a graceful server shutdown when a vMotion signal is detected.  
  • This enhancement will prevent unnecessary member loss by allowing the GemFire process on the affected ESXi host to stop gracefully before migration.  
  • After the vMotion operation completes and the host stabilizes, the node can be safely restarted and rejoin the cluster.  
  • Subscribers can monitor this article for updates on the availability of the enhancement.