vSAN Health Service - Physical Disk Health - Vendor Reported Drive Health
search cancel

vSAN Health Service - Physical Disk Health - Vendor Reported Drive Health

book

Article ID: 367770

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

This article provides detailed information about the vSAN Health Service - Physical Disk Health - Vendor Reported Drive Health introduced in vSAN 8.0U3. It leverages predictive failure analysis APIs provided by vendors to predict disk failures before they occur or cause errors. This feature is integrated with vSAN Health Service to present predictive failure warnings directly within the vSAN Skyline Health UI.

 

Environment

vSAN 8.0U3

Resolution

Q: What does the vSAN Vendor Reported Drive Health Check do?

The health check utilizes APIs provided by vendors to identify disks that are at risk of failing in the near future. This health check integrates with Proactive Hardware Management Service to analyze and report on the predictive failure data, enabling administrators to take proactive measures before actual disk failures occur, potentially saving data and preventing downtime.

Q: What does it mean when it is in an error state?

When the health check is in an error state, it indicates that one or more disks within the vSAN cluster have been identified as likely to fail soon, based on the predictive failure analysis provided by the OEM vendor. This state serves as an early warning to administrators, suggesting that immediate attention is required to assess and address the potential disk failure.

Q: How does one troubleshoot and fix the error state?

To troubleshoot and fix an error state indicated by the health check, vSAN always recommends removing or replacing problematic disks to maximally protect your data by

  1. Follow KB Requirements when replacing disks in a vSAN cluster to safely and properly remove the predicted failure disk from the environment.
  2. A resync will kick off if the disk(s) were removed properly
  3. Report the disk issue to vendor with Vendor health update IDVendor health update info ID, and Vendor message ID

If immediate replacement is not feasible, use row-based silence actions to temporarily mute the warning for the specific disk, allowing the user to address the issue without constant alerts. However, this should be a temporary measure until the disk can be safely replaced.

Note: We highly recommend not silencing alerts as there is the potential for one to forget the alert is silenced and then forget to address the issue, so it's best to not silence the alert so the user can remember to address the issue sooner rather than later when it can potentially become a more serious problem.

Important Note: The "Occured Time" in this column indicates when the Host event was triggered, not the exact time the event occurred on the vendor disk(typically a 10-minute delay). To view the actual timestamp from the vendor disk, you can

  1. SSH to VC command line and navigate to /var/log/vmware/vsan-health/
  2. In vmware-vsan-health-service.log, search for logs containing "Receive event" and locate the relevant event.
  3. Search keyword - VendorTimestamp from event logs