Dying Disk Handling (DDH) in vSAN

Products

VMware vSAN

Issue/Introduction

This article provides information about vSAN Dying Disk Handling (DDH).

Environment

VMware vSAN 7.x

VMware vSAN 8.x

Resolution

*** Please note that DDH does not currently monitor or act on vSAN disk group cache disks. This will be added in an upcoming version ***

The Dying Disk Handling (DDH) feature in vSAN continuously monitors the health of disks and disk groups in order to detect an impending disk failure or a poorly performing disk group. DDH diagnoses disk and disk group health by detecting either excessive IO latency for a vSAN disk or maximum log congestion that vSAN determines to be due to log leak issues in a vSAN disk group over an extended period.

Unhealthy disks and disk groups are marked as such and at this point, the disks or disk groups are no longer used for new data placement. If an unhealthy disk belongs to a deduplication and compression enabled disk group, then the whole disk group is marked as unhealthy. vSAN action for the data on these disks or disk groups depends on the configured policy and compliance state of objects that have their components on these disks or disk groups.

If a component on the unhealthy disk or disk group belongs to an object that can tolerate the failure of this disk or disk group, then vSAN will immediately mark that component as “absent” to avoid any impact on the performance of writes to that object. This means that the object is in a failure condition and will not be able to tolerate any additional failures if it is configured with the default policy of failuresToTolerate = 1.

Such components are fixed lazily by vSAN (after a 60 minute timeout). Furthermore, if a component on the dying disk or disk group is required to maintain availability or quorum of a vSAN object, evacuation is triggered immediately. vSAN applies a best effort procedure to evacuate all the “active” components from a “dying” disk but this process may fail if there are not enough resources in the cluster or if the components belong to inaccessible objects.

When DDH detects that a disk has exceeded the IO latency threshold during the monitoring interval vSAN will generate a VOB and log a message to the vsandevicemonitord.log file in the /var/run/log directory. The log entry below is an example for a disk that needs to be replaced once the required data evacuation is complete and the disk is in an "evacuated" state:

WARNING - WRITE Average Latency on VSAN device <NAA disk name> has exceeded threshold value <IO latency threshold for disk> us <# of intervals with excessive IO latency> times.

When DDH detects that a caching tier has excessive log congestion during the monitoring interval vSAN will generate a VOB and log to the vsandevicemonitord.log file. Excessive log congestion messages are in this format:

WARNING - Maximum log congestion on VSAN device <NAA disk name> <current intervals with excessive log congestion>/<intervals required to be unhealthy>

In both of these situations, vSAN triggers the evacuation of some or all data from the affected disks or disk-groups and the “overall disks health” section in the vSAN health monitoring UI reports one or more of the following operational states for the affected disks or disk-groups along with recommendations for the user. The recommendations after the evacuation is complete will differ depending on whether vSAN detected excessive IO latencies or excessive log congestion.

This is a list of the failures the vSAN Health monitoring UI will report:

Impending permanent disk failure, data is being evacuated (Health state - Yellow)

vSAN is evacuating the required data from this disk due to an impending permanent disk failure state. As long as sufficient resources are available in the rest of the vSAN cluster, vSAN will successfully evacuate all the “healthy” components from the “dying” disk but this may cause an increase in the overall datastore usage. If the cause here is excessive IO latencies, plan for disk or disk-group replacement. Alternatively, if the cause is high log congestion then prepare for a temporary increase in cluster usage as a result of this disk or disk-group evacuation, assuming there is enough space in the cluster to host the data evacuated. Wait for the evacuation to complete before removing the “dying” disk from the vSAN cluster. In the latter case the diskgroup should be removed from vSAN and then re-added back to vSAN.
Impending permanent disk failure, data evacuation failed due to insufficient resources (Health state - Red)

Evacuation failed due to insufficient resources in the cluster. Add the requested capacity into the fault-domain with the dying disk. Evacuation will proceed automatically once the additional resources are added. The active data on the affected disks will stay usable.
Impending permanent disk failure, data evacuation failed due to inaccessible objects (Health state - Red)

vSAN has evacuated everything that could be evacuated, now all the remaining components on this disk belong to objects that were inaccessible for reasons other than this DDH workflow. Users should examine the remaining data on the disks to decide if it is useful and needs to be recovered with help from VMware, or it can be purged. Many inaccessible object issues are caused by inaccessible swap objects.

Once all the inaccessible objects have been purged or recovered, DDH evacuation will proceed automatically and the disk will transition either into the “data evacuation completed” state or “data evacuation failed” state if sufficient resources were unavailable. If there is delay in resolving this situation, the only risk is losing any useful or important data residing on it especially since the disk could potentially fail permanently. The presence of this disk or disk group in the cluster should have no impact on the performance of the rest of the cluster since none of the accessible VMs will have any data left on this unhealthy disk.
Impending permanent disk failure, data evacuation completed (Health state - Yellow)

This is the disk state when all components required to maintain object accessibility have been evacuated from the disk and remaining components have been marked "absent" by vSAN. The logs in the vsandevicemonitord.log file should help you to determine if the disk was marked unhealthy due to excessive log congestion or I/O latencies. If the disk group was marked unhealthy due to excessive log congestion then user should remove it from vSAN cluster and add it back since it should be in a usable state for vSAN after it is added back. On the other hand if the disk was marked unhealthy due to excessive IO latencies it disk should no longer be used for vSAN.

If a "dying" disk belongs to a deduplication enabled disk group then the whole disk group would be marked unhealthy but after the required data evacuation, the vsandevicemonitord.log file will help you to determine which disks in the disk group were observed to have excessive I/O latencies, only those disks need to be replaced. User should collect the output from the vsandevicemonitord.log file, which contains the SMART logging information as well as the high latencies observed by vSAN and send this information to the disk vendor, along with the disk.

Additional Information

vSAN Health Service - Data Health – vSAN Object Health
How to handle lost or stuck I/O on a host in vSAN cluster