Verifying your hardware is functioning correctly

Verifying your hardware is functioning correctly

book

Article ID: 306441

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Faulty hardware can cause ESXi hosts to fail.

You may experience the following behaviours,

  • Unable to install on certified hardware
  • Unable to upgrade on certified hardware
  • Receiving MCE errors in your log files
  • Failed to load vmkernel: 0xbad0013
  • ESXi host failed but is now back up
  • ESXi host fails repeatedly
  • ESXi host stops responding
  • vSAN unable to write to cache
  • vSAN experiencing disk latency



Environment

VMware vSphere ESXi

Resolution

Run Hardware Diagnostic tests

Most servers are shipped with a hardware diagnostics CD, although other hardware vendors may choose to install a hidden utility partition located on your hard drive.
 
Note: If you are not experienced with computers or have any concerns, please contact your hardware vendor.
 
You can diagnose hardware related problems on your server by booting from the diagnostic CD or choosing Diagnostics from the boot device list.
 
These diagnostic tools allow you to:
  • Check the hardware configuration and verify that it is functioning correctly.
  • Test individual hardware components.
  • Diagnose hardware-related problems.
  • Obtain a complete hardware configuration.
When testing, if a component failure is detected, make note of any error code(s) and contact the hardware vendor.

Note: This diagnostic will not be able to detect the hardware fault unless it occurs during the test, so it must be run for quite some time.
 

Check your memory

Note: This process requires downtime on your ESX/ESXi host for up to 48 hours. In most cases, contacting your hardware vendor for a diagnostic utility as mentioned above should be sufficient in testing your hardware. Broadcom does not endorse or recommend any particular third party utility. However, there are third party options available to test your memory.

To test your memory:

  1. Download memtest86+ from http://www.memtest.org/ .
  2. Extract the ISO image from the .gz or .zip archive.
  3. Burn the image to CD.
  4. Boot your ESX/ESXi host from the CD.
  5. The memtest goes through each memory bank and checks for errors. Run the tool for several hours, at least until it starts pass 2, to ensure the full suite of tests have been executed.

    Note: If memtest86+ does not run on your hardware, contact your vendor for their memory test utility.
 
Ensure your server configuration conforms to Non-Uniform Memory Access (NUMA) specifications
 
Notes:
  • If you are not experienced with computers or have any concerns, please contact your hardware vendor.
  • Problems related to NUMA usually occur following a RAM upgrade or after an ESX/ESXi Server host installation.

You might see the following error:

The BIOS reports that NUMA node 1 has no memory. This problem is either caused by a bad BIOS or a very unbalanced distribution of memory modules.

NUMA is a system where each processor has separate memory. The separate memory helps to avoid a performance hit when several processors attempt to address the same memory.
 
The main requirement is that a similar amount of memory is installed beside each processor. If the amount of memory installed beside each processor is not similar, it is unbalanced and you might experience performance problems.

For more information, see Using NUMA Systems with ESXi

Additional information on NUMA is also available in the Resource Management Guide.

Additional Information

For more information about decoding machine check exceptions, see Decoding Machine Check Exception (MCE) output after a purple screen error.