vSphere HA Fails to Restart Some VMs During Host Outages

Products

VMware vSphere ESXi

Issue/Introduction

In certain vSphere environments, High Availability (HA) may not restart all virtual machines (VMs) when an ESXi host experiences an outage. This can result in some VMs remaining powered off or in an inconsistent state across multiple hosts, potentially impacting business continuity.

Environment

- VMware vSphere 6.x and later
- Environments with multiple hosts and VMs
- Applicable to various storage configurations (local, shared, SAN, NAS)

Cause

The primary issue stems from an unexpected interaction between vSphere High Availability (HA) and Distributed Resource Scheduler (DRS) during host failover events:

1. vSphere HA initiates VM failover: When an ESXi host fails, HA begins the process of restarting affected VMs on other hosts in the cluster.

2. DRS interference: By default, DRS may attempt to immediately vMotion newly restarted VMs to different ESXi hosts based on its load balancing algorithms.

3. Timing conflict: If the target host for DRS vMotion doesn't respond quickly enough, vSphere HA may interpret this as a failed failover attempt.

4. Failed restart declaration: Due to this misinterpretation, HA may prematurely declare the VM restart as failed, leaving the VM in a powered-off state.

Additional contributing factors may include:

- Network congestion during the failover process, slowing down communications between hosts
- Resource constraints on remaining hosts, potentially delaying VM restart operations
- Storage connectivity issues, which can impact the speed of VM restarts and vMotion operations
- Misconfiguration of HA settings or host isolation response, affecting failover behavior

This interaction between HA and DRS can result in some VMs remaining powered off or in an inconsistent state across multiple hosts, even when sufficient resources are available in the cluster for a successful failover.

Resolution

To improve HA performance and reduce the likelihood of failed VM restarts:

1. Optimize network configuration:
a. Create separate vmkernel ports for management, vMotion, and storage traffic
b. Configure appropriate VLANs for traffic separation
c. Set jumbo frames (MTU 9000) for vMotion and storage networks where supported
d. Use multiple NICs for redundancy and load balancing
e. Implement Network I/O Control (NIOC)

2. Review and adjust HA settings:
a. Log in to the vSphere Client
b. Select the cluster in question
c. Go to Configure > Services > vSphere HA
d. Review and adjust Admission Control, Datastore Heartbeating, and Host Isolation Response settings

3. Ensure proper storage connectivity:
a. Check all storage paths and ensure redundancy
b. Verify that all hosts have access to shared datastores
c. Review and update storage drivers and firmware if necessary

4. Monitor and address resource contention:
a. Use vSphere performance charts to identify resource bottlenecks
b. Adjust VM resource allocation as needed
c. Consider adding resources to hosts if consistently under pressure

5. Implement a regular health check routine:
a. Monitor host and VM health status regularly
b. Check for any recurring host outages and address root causes
c. Perform periodic failover tests in a controlled environment

6. Optimize DRS settings:
a. Review DRS automation level and rules
b. Ensure VM-Host affinity rules don't conflict with HA failover capabilities

7. Verify VM configuration:
a. Ensure VMware Tools is installed and up-to-date on all VMs
b. Review VM startup order and delay settings in the HA configuration

Additional Information

- Best Practices for VMware vSphere® High Availability Clusters

When investigating HA failover issues, it's crucial to analyze vSphere logs thoroughly. Pay special attention to the timing of events, any error messages, and the interaction between HA and DRS. In some cases, third-party backup or monitoring solutions may interfere with HA operations, so consider their impact when troubleshooting.

For environments with persistent issues, consider engaging with VMware GSS for a health check and optimization recommendations tailored to the specific infrastructure.