*** Please note that DDH does not currently monitor or act on vSAN disk group cache disks. This will be added in an upcoming version ***
The Dying Disk Handling (DDH) feature in vSAN continuously monitors the health of disks and disk groups in order to detect an impending disk failure or a poorly performing disk group. DDH diagnoses disk and disk group health by detecting either excessive IO latency for a vSAN disk or maximum log congestion that vSAN determines to be due to log leak issues in a vSAN disk group over an extended period.
Unhealthy disks and disk groups are marked as such and at this point, the disks or disk groups are no longer used for new data placement. If an unhealthy disk belongs to a deduplication and compression enabled disk group, then the whole disk group is marked as unhealthy. vSAN action for the data on these disks or disk groups depends on the configured policy and compliance state of objects that have their components on these disks or disk groups.
If a component on the unhealthy disk or disk group belongs to an object that can tolerate the failure of this disk or disk group, then vSAN will immediately mark that component as “absent” to avoid any impact on the performance of writes to that object. This means that the object is in a failure condition and will not be able to tolerate any additional failures if it is configured with the default policy of failuresToTolerate = 1.
Such components are fixed lazily by vSAN (after a 60 minute timeout). Furthermore, if a component on the dying disk or disk group is required to maintain availability or quorum of a vSAN object, evacuation is triggered immediately. vSAN applies a best effort procedure to evacuate all the “active” components from a “dying” disk but this process may fail if there are not enough resources in the cluster or if the components belong to inaccessible objects.
When DDH detects that a disk has exceeded the IO latency threshold during the monitoring interval vSAN will generate a VOB and log a message to the vsandevicemonitord.log file in the /var/run/log directory. The log entry below is an example for a disk that needs to be replaced once the required data evacuation is complete and the disk is in an "evacuated" state:
WARNING - WRITE Average Latency on VSAN device <NAA disk name> has exceeded threshold value <IO latency threshold for disk> us <# of intervals with excessive IO latency> times.
When DDH detects that a caching tier has excessive log congestion during the monitoring interval vSAN will generate a VOB and log to the vsandevicemonitord.log file. Excessive log congestion messages are in this format:
WARNING - Maximum log congestion on VSAN device <NAA disk name> <current intervals with excessive log congestion>/<intervals required to be unhealthy>
In both of these situations, vSAN triggers the evacuation of some or all data from the affected disks or disk-groups and the “overall disks health” section in the vSAN health monitoring UI reports one or more of the following operational states for the affected disks or disk-groups along with recommendations for the user. The recommendations after the evacuation is complete will differ depending on whether vSAN detected excessive IO latencies or excessive log congestion.