vSAN Health Service - Physical Disk Health

Products

VMware vSAN

Issue/Introduction

This article explains the Physical disk Health - Overall Disks Health check in the vSAN Health Service and provides details on why it might report an error.

Environment

VMware vSAN (All Versions)

Resolution

Q: What does the Physical Disk Health - Overall Disks Health check do?

Checks the physical disk operation status for all hosts in the vSAN cluster.

Q: What does it mean when it is in an error state?

If this check fails, the disk cannot be used by vSAN or vSAN Direct anymore with the possible reasons including the physical disk damage, the issue in reading the disk metadata, the vSAN software issue preventing it to use this disk or the disk is in read-only mode.

Q: What does it mean when the operational state is Impending permanent disk failure?

Dying Disk Handling (DDH) in vSAN continuously monitors the health of disks and disk groups in order to detect an impending disk failure or a poorly performing disk group. When such conditions are detected, vSAN marks the disk or disk group as unhealthy and might trigger data evacuation from the affected disk or disk group. Such disks and disk groups display an operational state of Impending permanent disk failure. For more information, see Dying Disk Handling (DDH) in vSAN

Note: This state is not available for vSAN Direct.

Q: What does it mean when the operational state is "Stuck I/O is detected"?

If I/O is stuck or lost on the storage controller or the storage disk, the ESXi storage stack will try to abort them using the task management request. If such a lost I/O is found on a host, vSAN will offline the disk to ensure that it doesn't affect other hosts on the cluster. If the cache device in non-dedup disk group encounters stuck I/O or if any of the disk in dedup disk group encounters stuck I/O, the entire disk group will be set to offline state. As a resolution, user need migrate the workload and power cycle the host. After power cycle of the host, collect the vm-support along with driver/firmware logs. These issues are seen due to faulty hardware or firmware bugs. The customer needs to open a case with the hardware vendor by collecting the hardware ( storcli and/or sascli logs) logs. Please refer to How to handle lost or stuck I/O on a host in vSAN cluster for more information.

Q: What does it mean when the operational state is "Device is experiencing I/O timeout"?

vSAN reports this issue when it detects a potential stuck I/O (.i.e, the I/O exceeds a time out period), which might lead to a stuck I/O scenario. No immediate action required for this disk if the issue only appears once and is resolved. If it leads to a stuck I/O, please refer to How to handle lost or stuck I/O on a host in vSAN cluster for more information.

As of vSAN version 8.0U3 the below additional checks are available:

Q: What does it mean when the operational state is "The disk encountered SMART disk failure."?

vSAN reports this issue when SMART Impending failures are reported by the disk. This disk/disk group will be evacuated and permanently unmounted and the customer needs to replace the problematic disk.

Q: What does it mean when the operational state is "The disk encountered a log congestion error"?

vSAN reports this issue when it detects excessive high log congestion on this disk group. vSAN will evacuate the data on the disk group and remount it. If the same issue occurs again in a week then vSAN will evacuate and rebuild the disk group. No action is needed from the user.

Q: What does it mean when the operational state is "Impending permanent disk failure”?

Impending permanent disk failure are same as SMART disk failure.
In case vSAN detects high latency from the disk then vSAN will evacuate the data from disk/disk group and permanently unmount the disk/disk group and the customer needs to replace this problematic disk.

Q: What does it mean when the operational state is "The disk encountered an unrecoverable read error."?

vSAN reports this issue when the vSAN metadata read encounters an unrecoverable read error from the disk. vSAN will evacuate the data from the disk/disk group and will rebuild it.

Q: What does it mean when the operational state is "Internal software(i.e. LSOM meta flusher in disk) is stuck"?

vSAN reports this status when internal software(i.e. LSOM meta flusher in disk) is stuck. We recommend you migrate the workload and power cycle the host.

Q: What does it mean when the operational state is "Internal software(i.e. PLOG elevator in disk) is stuck”?

vSAN reports this status when internal software(i.e. PLOG elevator in disk) is stuck. We recommend you migrate the workload and power cycle the host.

Q: What does it mean when the operational state is "Fail to rebuild disk during disk remediation”?

vSAN reports this issue when the disk fails to rebuild during disk remediation. To diagnose and remediate the issue, check vmkernel.log in /var/run/log/, and search the disk name. If the logs are unclear, promptly contact VMware Support and collect support bundles.

Q: What does it mean when the operational state is "Fail to unmount disk during disk remediation”?

vSAN reports this issue when the disk fails to unmount during disk remediation. To diagnose and remediate the issue, check vmkernel.log in /var/run/log/, and search the disk name. If the logs are unclear, promptly contact VMware Support and collect support bundles.

Q: How does one troubleshoot and fix the error state?

You need to examine the information displayed as part of the health check.

For example:

Is the disk offline or permanent failure indicating there is physical disk damage?
Is it an issue when trying to read the metadata of the drive? This implies that the drive is offline and unavailable for use.
Is it the vSAN software state that is the root cause, which in all likelihood will impact all of the disks on this host?
Is the disk in read-only mode while Out-of-Band management (e.g., iDRAC, iLO, iBMC) simultaneously alerts about remaining life of an SSD (Media Wearout Indicator low) for the specific disk? In this case seek vendor's help and consider replacement of the device.

Each of these individual checks must be considered to determine the corrective course of action. Some of the checks imply that the drive is offline, others imply that the drive is still online, but some corrective action might be needed.

Additional Information

For more information on collecting VMware vSAN logs, see Collecting vSAN support logs and uploading to VMware by Broadcom .

For more information about how vSAN handles dying disks, see Dying Disk Handling (DDH) in vSAN

For more information about disk failures in a vSAN cluster with deduplication, see Identifying specific disk failure in a vSAN deduplication cluster

Also, see:

vSAN Monitoring and Troubleshooting
VMware Ruby vSphere Console Command Reference for vSAN (Attached to this KB)
VMware vSAN Design and Sizing Guide

See KB vSAN Skyline Health Check Information for a complete list of vSAN Skyline Health checks

Attachments

VMware-Ruby-vSphere-Console-Command-Reference-For-Virtual-SAN.pdf get_app