First, a related KB is an important first read:
GemFire: How to avoid pitfalls of split brain: Do NOT use less than 3 Cache Servers!!
Second, here we describe how settings work. By default as described in above article, all locators have a default/base weight of 3. All cache servers (CS) have a base weight of 10, but then the oldest running CS gets 5 more in weight added, so it weighs 15.
You must realize this additional 5 in weight will "move around" to different CS's, depending on becoming the Lead CS, if the previous Lead leaves the distributed system(DS) for any reason.
If you set all or any specific locator weight to N, the affected locators will therefore weigh (base weight + configured additional weight) = (3 + N). So in locator gemfire.properties file, if you set member-weight to 5, that locator will weigh 8. If a locator fails and gets restarted or auto-reconnected, it will still weigh 8. It may or may not become the locator coordinator, but coordinators are not given different weight due to being a coordinator or not.
For cache servers, it's only slightly more complicated. If you set CS gemfire.properties files member-weight to M, that CS will get:
(base weight + configured additional weight + additional LEAD weight if LEAD) =
(10 + M + 0 if not lead) OR
(10 + M + 5 if Lead)
The LEAD cache server is oldest running CS. Assume member-weight is set to 6. So, when starting a brand new DS, the first CS started will weigh (10+6+5)=21. All other cache servers configured similarly to use weight of 6 will initially weigh (10+6+0)=16.
Later, if your lead member gets kicked out, evenly briefly, and auto-reconnects, it will no longer be the LEAD. In this case, your second oldest member will become lead, and so its weight will go from 16 to 21 using our above example, and the old lead, now reconnecting as youngest CS, will simply weigh (10+6+0)=16.
More importantly, how do you use all of the above to serve your purposes. Here we assume that the goal is to achieve better cluster stability, resiliency, recovery of your system back to full health.
It would be difficult to find a scenario where altering weights helps a customer to achieve better system stability related to membership. Much study was done to determine our default weights for locators and cache servers, leads. That said, there is an interesting configuration done by a couple customers which seems to offer some value.
Unfortunately, it is rather limited to very small clusters.
Scenario: 4 Locators, 3 Cache Servers
By default, this means (3+3+3+3 + 15+10+10)=(12+35)=47
Half of that total weight is 24. Suppose CS(lead)+CS(non-lead) crash simultaneously. That is a loss of 25 in one view Our Network Partition Detection algorithm will drive a shutdown of the entire cluster in this scenario, even though there was a 3rd CS still running and able to service your business. It does this to avoid other issues related to data divergence, beyond scope of this article.
To avoid this failure, it is possible to set LOCATORS member-weight to 6 in properties file. As a result, each locator will weigh (3+6)=9. Now, total weight of DS would be (9+9+9+9 + 15+10+10) = (36+35)=71. With this configuration, it is actually possible to experience a simultaneous failure of all cache servers, a loss of 35, and yet allow the locators to remain up and running (35 is less than 50% of 71).
There is definitely some stability to be gain here in this specific case, even if all CS's re temporarily down. If we lose CS(lead)+CS(non-lead) only, we would still have (9+9+9+9 + 10) remaining=46, so system would remain up and running to serve clients. If you have full redundancy on each of your 3 CS's, it's a win.
Note: (9+9+9+9 + 10) would quickly change to (9+9+9+9 + 15) as your non-lead CS begins LEAD.
If you have a larger system with more members, the above approach is not feasible.
Finally, for smaller clusters, it becomes more important to have locators running on different hosts than servers, to avoid losing too much weight in one view.
To provide even more detail here, you can actually lose more than 50% of your weight over time, and keep your system running. Using the above example, suppose you lose 2 cache servers with our default configuration. By default, this means (3+3+3+3 + 15+10+10)=(12+35)=47 already shared above.
Suppose you lose 2 CS's, but not simultaneously (not in 1 new view)..
Lose CS (lead or non-lead is irrelevant).
Lose CS(lead)=15 at time T0.
New weight= (3+3+3+3 + 10+10)= 32 Lost weight=15 (less than half)
So we stay up and running. New view will transition a new CS to be lead, so final new weight after failure will be (3+3+3+3 + 15+10)=37
Support then, even seconds later at time T1, your system experiences another failure. We would use our current total weight of 37 to determine whether you have lost more than half of your total weight. You would need to lose 19 weight more, to then drive our Network Partition Detection to drive entire cluster to kill itself. So, if you lose another CS, whether the new lead or a non-lead CS, it will weigh less than 19, and thus your system would not fail our Network Partition quorum.
As a summary, it is simply very important to understand t hat you are truly gaining value by using different weights from our default GemFire configuration. If you believe you have such a case, please open a new Support ticket to
make sure you are truly gaining stability.