Description:
Sometimes messages of the type HA check failed: There are not enough available resources to restart components running on 2 servers [srvX,srvY, srvZ…] are observed in the controller logs. These messages indicate that the amount of resources available in all the grid servers is not enough to be able to restart all the components running in any of the servers indicated if for any reason it goes down.
Solution:
One of the main components of AppLogic is the possibility of having Application High Availability. This implies that if one of the grid servers goes down, the rest of the remaining servers need to be able to take over its role and restart the different components formerly running in that server.
Even though when referencing the amount of resources a grid has, the total amount of CPU, Memory an Bandwidth is often considered, each node contributes a specific amount to that global figure, and each component has its own requirements. Therefore, the ability to restart certain components if a server goes down is going to be constrained by:
As a result, even though globally resources may be available, and even at node-level, HA may not be possible.
Let's consider an example. Let's imagine a grid has the following distribution of resources:
server srv1 : role primary, state up(enabled), 4.25/3.65 cpu, 12797/10142 MB mem, 801/1199 Mbps bw server srv2 : role secondary, state up(enabled), 7.00/1.00 cpu, 21468/2239 MB mem, 1800/200 Mbps bw server srv3 : role secondary, state up(enabled), 6.00/2.00 cpu, 12270/11437 MB mem, 911/1089 Mbps bw server srv4 : role none, state up(enabled), 7.95/0.05 cpu, 22164/1543 MB mem, 1651/349 Mbps bw server srv5 : role none, state up(enabled), 8.00/0.00 cpu, 16384/7323 MB mem, 1411/589 Mbps bw server srv6 : role none, state up(enabled), 6.75/1.25 cpu, 17372/6335 MB mem, 1350/650 Mbps bw server srv7 : role none, state up(enabled), 6.00/2.00 cpu, 20592/3115 MB mem, 1780/220 Mbps bw
And the list of applications running on srv2 is the following:
1 AP1: running, 0.50 cpu, 1536 MB, 500 MBps 2 AP2: running, 1.00 cpu, 6144 MB, 500 MBps 3 AP3: running, 0.25 cpu, 750 MB, 100 MBps 4 AP4: running, 3.00 cpu, 6144 MB, 300 MBps 5 AP5: running, 2.00 cpu, 6144 MB, 300 MBps 6 AP6: running, 0.25 cpu, 750 MB, 100 MBps
In this particular case srv2 requires 7.00 CPU, 21468 MB and 1800 MBps so in theory there are globally enough resources in the grid to accommodate the components. However, the message
HA check failed: There are not enough available resources to restart components running on 1 servers [srv2]
will be thrown in the controller (more servers may have the problem, but this is just an example for explanatory purposes):
In this case if srv2 fails, AppLogic will try to allocate its components starting with the server with least resources available, srv6, then srv7, srv3 and finally the controller, srv1. So, in this case:
Hence in this example HA cannot be ensured if srv2 goes down. In general it is recommended that at least one node with almost no applications running is available in the grid to accommodate a number of them in case one or several of the nodes restart. Grids should be provisioned with enough resources to make sure they are not running at the limit of their resources