ESXi S.M.A.R.T. health monitoring for hard drives
search cancel

ESXi S.M.A.R.T. health monitoring for hard drives

book

Article ID: 313033

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

This article provides steps to:
  • Help diagnose a local hard drive fault
  • Read the S.M.A.R.T. status of a hard drive (Self-Monitoring, Analysis, and Reporting Technology)


Symptoms:
  • The server reports a hard drive warning in POST (Power On Self Test)
  • Virtual machines cannot power on due to VMFS corruption on local hard drives
  • Very poor performance on local hard drives


Environment

VMware vSphere ESXi 

Resolution

Starting with ESXi 5.1, VMware added S.M.A.R.T. functionality to monitor hard drive health. The S.M.A.R.T. feature records various operation parameters from physical hard drives attached to a local controller. The feature is part of the firmware on the circuit board of a physical hard disk (HDD and SSD).


To read the current data from a disk:
 

  1. Open a console or SSH session to the ESXi host. For more information, see Using ESXi Shell in ESXi 5.x (2004746).
  2. Determine the device parameter to use by running the command:

    # esxcli storage core device list
     
  3. The expected output is a list with all SCSI devices seen by the ESXi host. For example:

    t10.ATA_____XXXXX________________________XXXXX
     
  4. Read the data from the device where device is a value found in step 3:

    # esxcli storage core device smart get -d device

    Note: External FC/iSCSI LUNs or virtual disks from a RAID controller might not report a S.M.A.R.T. status.


This table breaks down some example output:
 

Parameter Value Threshold Worst
Health Status OK N/A N/A
Media Wearout Indicator 0 0 0
Write Error Count N/A N/A N/A
Read Error Count 118 50 118
Power-on Hours 0 0 0
Power Cycle Count 100 0 100
Reallocated Sector Count 100 3 100
Raw Read Error Rate 118 50 118
Drive Temperature 27 0 34
Driver Rated Max Temperature N/A N/A N/A
Write Sectors TOT Count N/A N/A N/A
Read Sectors TOT Count N/A N/A N/A
Initial Bad Block Count N/A N/A N/A


Note: A physical hard drive can have up to 30 different attributes (the example above supports only 13). For more information, see How does S.M.A.R.T. function of hard disks Work?

Note: The preceding link was correct as of September 2, 2014. If you find the link is broken, provide feedback and a VMware employee will update the link.


A raw value can have two possible results:

  • A number between 0-253
  • A word (for example, N/A or OK)

 

Column descriptions

Note: The values returned and their meaning for each of these columns can vary by manufacturer. For more information, please consult your hardware supplier.

  • Parameter

    This is a translation from the attribute ID to human-readable text. For example:

    hex 0xE7 = decimal 231 = "Drive Temperature"

    For more information, see the Known ATA S.M.A.R.T. attributes section of the S.M.A.R.T. Wikipedia article.

    Note: The preceding link was correct as of September 2, 2014. If you find the link is broken, provide feedback and a VMware employee will update the link.
     
  • Value

    This is the raw value reported by the disk. To illustrate a simple Value using the example above, the Drive Temperature is reported as 27, which means 27 degrees Celsius.

    A Value can either be a number (0-253) or a word (for example, N/A or OK).
     
  • Threshold

    The (failure) limit for the attribute.
     
  • Worst

    The highest Value ever recorded for the parameter.

 

smartd daemon

ESXi 5.1 also has the /sbin/smartd daemon in the DCUI installed. This tool does not have any command line switches or interaction with the console. If you run the command in the shell, a S.M.A.R.T. status is reported in the /var/log/syslog.log file.

For example:

XXXX-XX-28T10:15:12Z smartd: [warn] t10.ATA_____XXXXX___________________XXXXX________: below MEDIA WEAROUT threshold (0)
XXXX-XX-28T10:15:12Z smartd: [warn] t10.ATA_____XXXXX___________________XXXXX________: above TEMPERATURE threshold (27 > 0)
XXXX-XX-28T10:15:12Z smartd: [warn] t10.ATA_____YYYYY________________________YYYYY: above TEMPERATURE threshold (113 > 0)


Notes:

  • You can stop the daemon by typing Ctrl+c.
  • Logged events should be viewed with caution. As can be seen in the example, all three warnings are irrelevant. The output can vary greatly between manufacturers and disk models.



Additional Information

The vm-support bundle also captures S.M.A.R.T. details in the smartinfo.sh.txt file. The file can be found in the commands/ directory.