Identifying critical Guest OS failures within virtual machines
search cancel

Identifying critical Guest OS failures within virtual machines

book

Article ID: 315245

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

This article provides guidance for identifying critical kernel-level faults within common virtual machine guest operating systems on VMware ESX/ESXi. This article does not attempt to determine the root cause of the guest operating system issue. For assistance with detailed analysis of guest operating system issues, engage the guest operating system vendor.

Note: There is a difference between a failure in the guest operating system and the virtual machine. In the case of a guest operating system failure, the virtual machine's virtual hardware continues running, virtualizing the failed guest operating system. If the virtual machine itself has failed, see Interpreting virtual machine monitor and executable failures (1019471) or Determining why a virtual machine was powered off or restarted (1019064).

When both virtual machine failures and guest operating system failures are present in the same environment, it may be indication of an underlying hardware issue. Consider both when attempting analysis of a pattern of outages.


Symptoms:
  • A virtual machine's guest operating system hosted on VMware ESX/ESXi halts with a critical error reported on the console.
  • You may see errors similar to:
    • BAD_POOL_HEADER
    • KMODE_EXCEPTION_NOT_HANDLED
    • PAGE_FAULT_IN_NONPAGED_AREA
    • STOP: 0x00000050 (0xFFFFFFF8,0x00000000,0xF9CF5C88,0x00000000)
    • STOP: 0x00000019 (0x00000000,0xC00E0FF0,0xFFFFEFD4,0xC0000000)
    • Hardware malfunction Call your hardware vendor for support The system has halted.


Environment

VMware vSphere ESXi 6.x
VMware vSphere ESXi 7.x
VMware vSphere ESXi 8.x

Resolution

Different guest operating systems represent critical kernel-level faults using different terminology and error messages. The operating system will usually halt all remote network connectivity and report a message full-screen on the console. An error message or coredump may also be recorded to disk. However, the guest operating system may also automatically restart the virtual machine, instead of or in addition to recording any diagnostic information.

Make note of the date and time when the issue occurred, both from the perspective of the ESX/ESXi host and from the guest operating system. This may be key in correlating log messages between the two systems. Account for a time offset within the Guest OS or on the host.

Capture of data from the console

Take a screenshot of the virtual machine console to preserve all error messages and values reported. Make note of any STOP or Error codes, error messages, driver names, or other pertinent information displayed on the screen.

Search the guest operating system vendor's documentation and support information for these error codes, messages, and driver names.

  • Windows reports faults using a Blue Screen. For more information, see the Microsoft TechNet article Demystifying the 'Blue Screen of Death'.


     
  • Linux reports faults using a Kernel Panic or Oops.


     

Capture of data by the guest operating system

Various operating systems may automatically capture information about the failure and save it somewhere for further analysis. This may not be the default configuration, or have specific requirements. See the operating system documentation for further information.

Native Windows Dump Saving Method:

  • Save a minidump or complete memory dump to a paging file on disk. During the next startup, the contents are copied to a file in the %SystemRoot% directory. For more information, see the Microsoft Knowledge Base article 254649.

Native Linux Dump Saving Methods:

  • Network-based dump collection using the Netdump protocol.
  • Disk-based dump collection using a pre-configured dump partition. For more information, see the LWN article on Diskdump.

Note: If the guest operating system is not or cannot be configured to save information about the failure automatically, use the virtual machine suspend states to capture the information.

Capture of data using virtual machine suspend states

When an operating system is running as a guest within a VMware virtual machine, additional options for collecting detailed information become available. The virtual machine can be suspended using the client or command line utilities, which writes (checkpoints) all memory used by the virtual machine's guest operating system to a file on disk. This suspending memory image can be converted to a coredump file using the vmss2core Tool. For more information, see the vmss2core tool.

Compared to other debugging methods, a major advantage of the vmss2core Tool is that it requires no modification to virtual machines' guest operating system: no additional software needs to be installed or configuration changed from the defaults.

  1. Suspend the virtual machine, then copy the snapshot (.vmsn), suspend (.vmss), or non-monolithic memory (.vmem) files to a Workstation 7.1 or Fusion 3.1 host for conversion and debugging. For more information, see Suspending a virtual machine on ESX/ESXi to collect diagnostic information (2005831).
  2. Resume the virtual machine from the suspended state, and then restart it.
  3. Use the vmss2core Tool to convert the memory checkpoint to a coredump file. Typically, the tool will be used with OS-specific options.

Note: The guest operating system can be configured to automatically restart after a kernel-level fault, and the VMware HA Virtual Machine Monitoring feature can automatically restart a virtual machine that has become unresponsive. In both cases, the virtual machine cannot be suspended at the time of the failure. Either use the guest operating system's native support for information capture or disable the automatic restart. For more information, see:

Logs preceding the outage from within the guest operating system

Collect all logging emitted by the guest operating system kernel and drivers leading up to the outage. Collect and review this information after restarting the virtual machine. Make note of any STOP or Error codes, error messages, drivers or other pertinent information displayed in the logs.

Search the guest operating system vendor's documentation and support information for these error codes, messages, and driver names.

If the failure can be reproduced, it may be helpful to configure the guest operating system to log additional information via a serial communications port. VMware virtual machine serial ports can be configured to append all outbound data to an on-disk log file, preserving it in the case of a guest operating system failure. For more information, see the Parallel and Serial Port Configuration section of the Virtual Machine Administration Guide for your version of vSphere.

Logs preceding the outage from the virtual machine and host

Collect all logging emitted by the virtual machine and ESX/ESXi host leading up to the outage. Make note of any error codes, messages or other pertinent information that occurred at the same time as the outage, and did not occur during other times.

For more information, see Collecting diagnostic information for VMware products (1008524) and Location of log files for VMware products (1021806).

New Software Considerations

Consider any recent changes to the guest operating system, such as software or driver installation, that may be related to the outages. If the operating system is stable once such software is removed, engage the vendor of the new or changed software.

Note: For virtual machines that have been converted from physical using P2V or Converter, remove any legacy hardware drivers or monitoring software. These drivers are not required and may have adverse effects.

Guest operating system kernel coredump analysis

Analysis of the coredump produced natively by the guest operating system, or converted using vmss2core, is best performed by the guest operating system or driver vendor. For information on performing your own analysis, consult the guest operating system vendor's documentation. For example:

If further assistance is required, file a Support Request with VMware Support and the operating system vendor. Include the information listed in the Additional Information section. For more information on filing a Support Request, see Filing Support Requests in Customer Connect and via Cloud Services Portal.

 

Additional Information

Note: If you require the assistance of a VMware Technical support engineer:

  1. Gather the VMware Support Script Data. For more information, see Collecting diagnostic information for VMware products (1008524).
  2. Gather kernel and system logs from within the guest operating system leading up to the outage.
  3. Gather the coredump from within the guest operating system, from the output of vmss2core, or gather the checkpoint memory state files. 
  4. File a support request with Broadcom Support

Determining why a virtual machine was powered off or restarted
Interpreting virtual machine monitor and executable failures
Determining if a High Availability Virtual Machine Monitoring event caused a virtual machine to reboot
Suspending a virtual machine on ESX/ESXi to collect diagnostic information