When troubleshooting outages affecting multiple virtual machines or ESX/ESXi hosts, sometimes no explicit cause of the failure can be identified. In these cases, the pattern of outages may suggest a hardware fault as the root cause:
This article provides guidance for reviewing a series of ESX/ESXi host VMkernel and virtual machine failures, and the physical CPUs they are associated with. A physical CPU is only one component that may be in common; for others see Correlation during an outage affecting multiple virtual machines (1019000).
If there are data points covering multiple ESX/ESXi host VMkernel or virtual machine failures, compare and contrast the outages to determine whether the physical CPUs might be a common thread between outages. More data points will reduce the chance of randomness affecting analysis. It is difficult to make any conclusion or recommendation with only 1 - 3 data points. The pattern becomes stronger and more reliable with more data.
Example: A host has failed with a purple diagnostic screen twice, and three virtual machines failed with internal monitor errors on the same host in the past. None of the failures resembles a known issue, or each other.
It is important to determine whether there is any relationship between these five outages.
This article includes these sections:
When the ESX/ESXi VMkernel fails with a purple diagnostic screen, it is usually apparent which CPU initiated or caused the failure from either the purple diagnostic screen itself or the log extracted from the zdump afterward:
*2
. CPU2
. cpu2: <world message>
.Create a list of all VMkernel failures experienced, both on this host and on other hosts in the cluster. Consider these points:
When a virtual machine fails on a CPU exception, the VMkernel logs the fault type and physical CPU associated. The log entry looks like this:
cpu2: <world>)WARNING: World: vm <nnnn>: <mmmm>: vmm0:<VirtualMachineName>:vcpu-<n>:VMM64 fault 6
Review the VMkernel logs. Create a list of all virtual machine monitor failures, on all hosts in the cluster. Consider these points:
When a userworld process, such as a management service, fails on an ESX/ESXi host, the VMkernel logs the physical CPU associated. The log entry looks like this:
cpu2: <world>)UserDump: <nnnn>: Dumping cartel <nnnnnn> (from world <world>) to file <filepath>>/zdump
Review the VMkernel logs. Create a list of all userworld process failures, on all hosts in the cluster. Consider these points: