Assessing commonalities of an outage affecting multiple virtual machines

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

An outage affecting multiple virtual machines may have broader scope than at first apparent, due to a root cause in some aspect of their common infrastructure. Identifying a pattern of affected components is helpful when attempting to narrow down potential causes.

This article provide guidance to determine what infrastructure multiple virtual machines have in common.

Note: A correlation of multiple issues does not always imply causation of one by another, but may instead suggest a common cause. It is also possible that two issues are unrelated and have no common cause. These results are merely a guide, rather than a certain indication.

Symptoms:

Clients connected to services running in one or more virtual machines are no longer accessible.
Applications dependent on services running in one or more virtual machines are reporting errors.
One or more virtual machines are no longer responding to network connections.
One or more virtual machines are no longer responding to user interaction at the console.

Environment

VMware ESXi 3.5.x Embedded
VMware ESX 4.0.x
VMware vCenter Server 5.1.x
VMware ESXi 3.5.x Installable
VMware VirtualCenter 2.5.x
VMware vCenter Server 4.1.x
VMware vCenter Server 5.5.x
VMware vSphere ESXi 5.5
VMware ESX 4.1.x
VMware ESX Server 3.5.x
VMware vSphere ESXi 5.1
VMware ESXi 4.1.x Installable
VMware vCenter Server 4.0.x
VMware ESXi 4.0.x Installable
VMware vCenter Server 5.0.x
VMware ESXi 4.1.x Embedded
VMware ESXi 4.0.x Embedded
VMware vSphere ESXi 5.0

Resolution

Process Overview

Identifying any patterns that occur in multiple outages (or other issues affecting multiple virtual machines), and then narrowing down the potential causes of those outages or issues, involves several steps. You must first identify which virtual machines are affected and which are not. Next, assess your infrastructure, which components (host machines, datastores / storage infrastructure, network infrastructure) are involved and which are not. Start this assessment from the host machines and work outwards. Examine any involved components. At each step, you can begin to draw conclusions about any potential causes of the outages or issues.

This is a conceptual overview of the process steps, for each component of your infrastructure:

Compile two lists:
- A list of all virtual machines that are experiencing the outages
- A list of virtual machines that are not experiencing the outages
Note: The results of this step are independent of the infrastructure component, so you only need to follow this step one time.
Use the vCenter maps to review the relationships between virtual machines on the two lists and their backing infrastructure.

A vCenter map is a visual representation of the vCenter Server topology. Maps show the relationships between virtual and physical resources available to vCenter Server. This can be used to relate objects to each other. For more information, see the map documentation for your version of vCenter:
- vCenter 5.0: The Using vCenter Maps section of the vCenter Server and Host Management Guide
- vCenter 4.1: The Using vCenter Maps section of the vSphere Datacenter Administration Guide
- vCenter 4.0: The Using vCenter Maps section of the vSphere Basic Administration Guide
- VirtualCenter 2.x: The Resource Maps section of the Basic Administration Guide
To navigate to the Maps tab in vCenter Server:
1. Open the vSphere Client and connect to the vCenter Server.
2. Provide administrator credentials when prompted.
3. Ensure that you are in the Hosts & Clusters view.
4. Select the root of the tree on the left pane (the hostname or domain name of the vCenter Server).
5. Click the Maps tab.
This displays a map, showing the relationships between elements of the virtual infrastructure.
Use the intersection of the two lists from step 1, and the maps from step 2, to identify which components of your infrastructure are fully functional and which must be investigated. This chart shows the idea:

Any components that are used by both virtual machines experiencing the outages and virtual machines not experiencing the outages can be deduced to be functional. Any components which are used only by virtual machines experiencing the outages cannot be determined to be functional, and must be investigated further.

Examine each of the components of unknown functionality in turn. This article includes these sections:

Common Element: Host Compute Infrastructure

Note: Host Compute refers to the common elements of one physical host server, where the CPU and RAM are considered together.

Create a vCenter map showing the host tier:
1. Open the Maps tab.
2. Ensure that only the Host to VM option is selected.
3. Click Apply Relationships.
This displays a map showing the relationships between virtual machines and hosts.
Identify the affected and unaffected virtual machines in the map.
Determine whether the affected virtual machines rely on the same hosts.
If multiple virtual machines experiencing the outages are all on the same host, investigate further:

Note: If there are virtual machines on the same host that are not experiencing the outage, then the host is not at fault.
1. If the host with the affected virtual machines is itself unresponsive, the scope is larger than initially assumed. Troubleshoot the unresponsive host instead. For more information, see Determining why a host is labeled as Not Responding and multiple virtual machines are labeled as Disconnected (1019082).
2. Validate whether the problem is specific to a host. Try to migrate the virtual machine to another host that is known to be functional, using vMotion, and observe whether the problem follows the virtual machine. For more information, see Migrating Virtual Machines in the Basic System Administration Guide for your version of ESX/ESXi.
3. If troubleshooting multiple virtual machine failures, determine whether they happened on the same physical CPU or CPU package. For more information, see:

Multiple virtual machines on the same host may all experience similar symptoms if there is an upstream network or storage issue that only affects the one host, such as a network or SCSI interface connectivity issue. Continue with the Storage and Network Infrastructure sections of this article.

Common Element: Storage Infrastructure

Create a vCenter map showing the storage tier:
1. Open the Maps tab.
2. Ensure that only the VM to Datastore option is checked.
3. Click Apply Relationships.
This displays a map showing the relationships between the virtual machines and the datastores.
Identify the affected and unaffected virtual machines in the map.
Determine whether the affected virtual machines rely on the same datastores. If the affected virtual machines rely on multiple datastores, determine whether those datastores use a common storage fabric, array, or spindles.
If multiple virtual machines experiencing similar symptoms are all on the same datastore(s), investigate further:

Note: If there are virtual machines on the same datastore(s) that are not experiencing the outage, then the common datastore is not at fault.
1. If the virtual machines are all on the same datastore and host, and that datastore is shared among multiple hosts, examine the connectivity from that host to the shared storage infrastructure first. The connectivity itself (for example, fibre channel HBAs, iSCSI initiators, NFS network interfaces) could be at fault.
2. Determine whether the affected virtual machines are all on the same storage array, storage group/pool, or spindles.
3. Investigate and troubleshoot storage infrastructure issues. For more information, see:

Common Element: Network Infrastructure

Consider whether the affected virtual machines utilize a common network infrastructure, and whether there are any unaffected virtual machines using the same network infrastructure:

Create a vCenter map showing the network tier:
1. Open the Maps tab.
2. Ensure that only the VM to Network option is checked.
3. Click Apply Relationships.
This displays a map showing the relationships between the virtual machines and the network port groups.
Identify the affected and unaffected virtual machines in the map.
Determine whether the affected virtual machines rely on the same network port groups.
If multiple virtual machines experiencing similar symptoms are utilizing the same port groups, investigate further:

Note: If there are virtual machines on the same network port groups that are not experiencing the outage, then those port groups are not at fault.
1. Consider whether the affected virtual machines all use the same upstream physical network connection, and whether there are any unaffected virtual machines, hosts or other physical servers which use the same network link.
2. Investigate and troubleshoot network infrastructure issues. For more information, see Troubleshooting virtual machine network connection issues (1003893).

Additional Information

If you have gone through all of the steps and cannot identify a common shared resources between all of the virtual machines, troubleshoot each virtual machine independently. For more information, see Troubleshooting a virtual machine that has stopped responding (1007819).

Determining if virtual machine and ESX host unresponsiveness is caused by hardware issues
Identifying Fibre Channel, iSCSI, and NFS storage issues on ESX/ESXi hosts
Verifying that ESX/ESXi virtual machine storage is accessible
Troubleshooting virtual machine network connection issues
Troubleshooting VMFS-3 datastore issues
Troubleshooting a virtual machine that has stopped responding
Troubleshooting a virtual machine that has stopped responding
Determining why a single virtual machine is inaccessible on an ESX/ESXi host or vCenter Server system
ESX/ESXi hosts do not respond and is grayed out
Virtual machine and ESX/ESXi host outage pattern analysis across physical CPUs
複数の仮想マシンに影響を与える停止の共通点を見極める
评估影响多个虚拟机的中断的共性