Virtual machine and ESX/ESXi host outage pattern analysis across physical CPUs

Products

VMware vSphere ESXi

Issue/Introduction

When troubleshooting outages affecting multiple virtual machines or ESX/ESXi hosts, sometimes no explicit cause of the failure can be identified. In these cases, the pattern of outages may suggest a hardware fault as the root cause:

An ESX/ESXi VMkernel failure manifests as a purple diagnostic screen. For more information, see Interpreting an ESX/ESXi host purple diagnostic screen (1004250).
A virtual machine failure manifests as a World Panic or VMM Fault. For more information, see Interpreting virtual machine monitor and executable failures (1019471).

This article provides guidance for reviewing a series of ESX/ESXi host VMkernel and virtual machine failures, and the physical CPUs they are associated with. A physical CPU is only one component that may be in common; for others see Correlation during an outage affecting multiple virtual machines (1019000).

Resolution

If there are data points covering multiple ESX/ESXi host VMkernel or virtual machine failures, compare and contrast the outages to determine whether the physical CPUs might be a common thread between outages. More data points will reduce the chance of randomness affecting analysis. It is difficult to make any conclusion or recommendation with only 1 - 3 data points. The pattern becomes stronger and more reliable with more data.

Example: A host has failed with a purple diagnostic screen twice, and three virtual machines failed with internal monitor errors on the same host in the past. None of the failures resembles a known issue, or each other.

It is important to determine whether there is any relationship between these five outages.

This article includes these sections:

VMkernel Failures: Purple Diagnostic Screen
Virtual Machine Monitor Failures
Userworld Process Failures

VMkernel Failures: Purple Diagnostic Screen

When the ESX/ESXi VMkernel fails with a purple diagnostic screen, it is usually apparent which CPU initiated or caused the failure from either the purple diagnostic screen itself or the log extracted from the zdump afterward:

In ESX/ESXi 4.0 and higher, the physical CPU number which reported the failure is prefixed by an asterisk. For example, *2.
In ESX/ESXi 3.x, the physical CPU number which reported the failure has its name capitalized. For example, CPU2.
In serial line logs or logs extracted from the zdump, the physical CPU number prefixes each log line. For example, cpu2: <world message>.

Create a list of all VMkernel failures experienced, both on this host and on other hosts in the cluster. Consider these points:

Does the stack trace of each failure appear similar? This suggests that the failure is related to what the VMkernel or other component was doing at the time, rather than the physical hardware. This is especially true of similar failures are seen on multiple hosts. In this case, troubleshoot the failures independently. For more information, see Interpreting an ESX/ESXi host purple diagnostic screen (1004250).
Does the physical CPU which encountered the fault match between failures? Does the physical CPU which encountered the fault reside within the same CPU package for each failure? If either of these is true, it suggests that the failure is related to the physical CPU or motherboard socket.

Virtual Machine Monitor Failures

When a virtual machine fails on a CPU exception, the VMkernel logs the fault type and physical CPU associated. The log entry looks like this:

cpu2: <world>)WARNING: World: vm <nnnn>: <mmmm>: vmm0:<VirtualMachineName>:vcpu-<n>:VMM64 fault 6

Review the VMkernel logs. Create a list of all virtual machine monitor failures, on all hosts in the cluster. Consider these points:

Was the same virtual machine involved in multiple failures? Do the events leading up to the failure match? Does the virtual machine workload match between failures? Does the virtual machine configuration or original template match between failures? If any of these is true, it suggests that the failure is related to the virtual machine configuration or workload. For more information, see Interpreting virtual machine monitor and executable failures (1019471).
Does the physical CPU which encountered the fault match between failures? Does the physical CPU which encountered the fault reside within the same CPU package for each failure? If either of these is true, it suggests that the failure is related to the physical CPU or motherboard socket.

Userworld Process Failures

When a userworld process, such as a management service, fails on an ESX/ESXi host, the VMkernel logs the physical CPU associated. The log entry looks like this:

cpu2: <world>)UserDump: <nnnn>: Dumping cartel <nnnnnn> (from world <world>) to file <filepath>>/zdump

Review the VMkernel logs. Create a list of all userworld process failures, on all hosts in the cluster. Consider these points:

Was the same userworld process involved in multiple failures? Did the userworld log similar events leading up to each failure? If either of these is true, it suggests that the failure is related to the userworld or host configuration, or other environmental condition. Search the knowledge base for specific symptoms or contact VMware Support.
Does the physical CPU which encountered the fault match between failures? Does the physical CPU which encountered the fault reside within the same CPU package for each failure? If either of these is true, it suggests that the failure is related to the physical CPU or motherboard socket.

Additional Information

Interpreting an ESX/ESXi host purple diagnostic screen
Assessing commonalities of an outage affecting multiple virtual machines
Interpreting virtual machine monitor and executable failures