GemFire: How to avoid pitfalls of split brain: Do NOT use less than 3 Cache Servers!!

Products

VMware Tanzu Gemfire

Issue/Introduction

GemFire Distributed systems using 2 Locators with 2 Cache Servers are not recommended. This is especially true when the locators are on the same hosts, or vulnerable to the same potential infrastructure issues that could negatively impact the Cache Servers.

All GemFire members of the Distributed System have a "weight". Locators have a weight of 3 by default. All Cache Servers have a weight of 10, EXCEPT the oldest still living Cache Server, designated as the Lead, which has a weight of 15.

This is a total weight of 15+10+3+3=31. If the GemFire Distributed System (DS) ever loses more than 50% of the weight in one instant/(view), GemFire initiates some split brain type special behavior by default (enable-network-partition-detection=true).

This could result with the ENTIRE DS shutting down, which is something customers do not always realize when creating these clusters. This is driven, for example, in the following situation.

L1 and CS1(Lead) on host X (3+15=18)
L2 and CS2 on host Y (3+10=13)

If host X fails for any reason, gets unplugged, it is going to become unresponsive from the perspective of members on host Y. One might hope that host Y processes/JVM's L2 and CS2 would continue to run un-impacted and serve the applications/clients. Unfortunately, that is not the case by default.

L2 and CS2 determine that quorum is lost, AND they are on the losing side of that equation given only 13/31 total weight. Given that >50% of the weight (18/31) is now unresponsive/lost for whatever reason, L2 and CS2 will therefore shut themselves down, as they are on the losing side of the split brain. These processes have no way of knowing whether L1 and CS1 are up and running, or have crashed. To protect the system from data divergence, where data becomes very out of sync, the losing side kills itself.

IF it is the case that L1 and CS1 are still running, given they are on the winning side of the split (18/31), they will continue to function and serve clients, and ultimately clients will find themselves connected to this server. When the lost/disconnected/crashed/unreachable processes become part of the full healthy cluster, clients will ultimately be balanced over time to all members.

Environment

Product Version: 9.15
OS: ALL

Resolution

We do not recommend disabling the enable-network-partition-detection=true flag. If you do set to false, you would then experience different behavior than described above. Instead, if these 2 hosts lose connectivity with each other, they would then each form there own DSa and DSb, and the data on these system could diverge to an unrecoverable state, requiring manual intervention and loss of data.

The resolution to this entire issue is to have more members. We recommend at least 3 Cache Servers, so that the loss of a single host will not drive the loss of more than 50% of the weight in one view.

Some have chosen to place locators are separate hosts, so that if one loses the Lead CS (weight 15), that will not result in a loss of more than 50% (15/31), and therefore the other cache server and 2 locators will continue to function and service clients.

Even with larger systems, say 12 members, it is possible to experience a major network split. It is important to understand your weak points, where one failure of some piece of hardware, for example, could result in a loss of quorum. You want to use enable-network-partition-detection=true, but you also want to greatly reduce the chance that a single failure will drive a loss of more than 50% of the weight.

Example:
Suppose Rack 1:
Host1 15+3
Host2 10+3
Host3 10
Host4 10
Host 5 10
Host 6 10
Support Rack 2:
Host 7 10+3
Host 8 10+3
Host 9 10
Host 10 10
Host 11 10
Host 12 10

==========
Total Rack 1 weight = 71
Total Rack 2 weight = 66
Total weight = (Total CS's + Total Locators) =125+12=137

50% = 68.5 => Losing 69 or more is a loss of quorum.

In this example, a full crash of Rack 1 will cause ENTIRE cluster failure due to >50% weight loss. With a full crash of Rack 2, Rack 1 will survive given it is on the winning side of the split (71/137).

If you want to avoid negative impact from full Rack 1 crash, perhaps you could increase protection by adding 2 Locators on another Host not on one of those Racks.

Change:
Add Host 13 6 (3+3)
Now Total Weight = 143
50% = 71.5

Therefore, losing either Rack in some full crash would not result in a complete failure because the remaining weight is > 50%. Rack 2 + Host 13 = 66+6 = 72. Losing Rack1=71 is less than 50%, and thus doesn't drive Rack2 JVMs to shut down.

There is much documentation on this in the GemFire documentation. At the end of the day, the goal is to protect the system as much as possible, and having only 2 Cache Servers with co-located Locators does not achieve this protection.