Overview
VMware Fault Tolerance provides continuous availability to virtual machines by keeping a secondary protected virtual machine up and running and in sync in case a complete ESX host failure occurs in the environment.
However, some ESX host component failures may not cause complete server failure. In these cases, Fault Tolerance may appear to behave inconsistently.
Note: VMware recommends that you configure the Fault Tolerance logging NIC to use its own dedicated 1GB+ NIC.
Fault Tolerance failure scenarios
Currently, Fault Tolerance failures are only triggered when there is no communication between the primary and secondary virtual machines.
These three scenarios may occur:
- A deterministic scenario, where you can predict how a failover will occur
These events are deterministic:
- An ESX host failure which causes complete host failure
- The primary virtual machine process fails (or is non-responsive) on the ESX host
- A Fault Tolerance test is initiated from vCenter Server
- A reactionary scenario, where a failover may occur but you do not know the expected outcome ahead of time
These events are reactionary:
- Fault Tolerance logging NIC communication is interrupted or fails
- Fault Tolerance logging NIC communication is very slow
Reactionary events are not predictable because there is a race between the primary and secondary virtual machines to see which will go live. The virtual machine that wins the race stays alive and the other is terminated. The race prevents a split brain scenario that can cause data corruption. In these cases you may see inconsistent results depending on the host that wins the ownership of the virtual machine.
- A no action taken scenario, where no failover occurs because Fault Tolerance does not monitor for this type of event.
Fault Tolerance does not currently detect or respond to events which are not directly involved with its operation. No action is taken for these events: - Management network interruption or failure
- Virtual machine network interruption or failure
- HBA failures that do not affect the entire host
- Any combination of the above
Testing Fault Tolerance
To test VMware Fault Tolerance properly, communication between the primary and secondary virtual machines must fail. VMware provides a Test Failover function from the virtual machine, which is the best option for testing VMware Fault Tolerance failover. If you want to perform manual failover tests, only deterministic events produce reliable results. Reactionary or no action taken scenarios can produce unexpected results.
These are proper testing scenarios with their expected outcomes:
Note: These tests assume two hosts, Host A and Host B, with the primary fault tolerant virtual machine running on Host A, and the secondary virtual machine running on Host B.
- Select the Test Failover Function from the Fault Tolerance menu on the virtual machine.
This tests the Fault Tolerance functionally in a fully-supported and non-invasive way. In this scenario, the virtual machine fails over from Host A to Host B, and a secondary virtual machine is started back up again. VMware HA failure does not occur in this case.
- Host A complete failover
This scenario can be accomplished by pulling the host power cable, rebooting the host, or powering off the host from a remote KVM (such as iLO, DRAC, or RSA). The secondary virtual machine on Host B takes over immediately and continues to process information for the virtual machine. VMware HA failover occurs.
- Virtual machine process on Host A fails
This scenario can be accomplished by terminating the active process for the virtual machine by logging into Host A. The secondary virtual machine takes over and no VMware HA failure occurs. VMware does not recommend testing in this way. For more information on terminating a virtual machine, see Powering off an unresponsive virtual machine on an ESX host (1004340).