vSAN Health Service - Capacity utilization – What if the most consumed host fails

Products

VMware vSAN

Issue/Introduction

This article explains the Capacity utilization Health – What if the most consumed host fails check in the vSAN Health Service and provides details on why it might report an error.

Note: In VC versions < 7.0 U2, this health check was named "After one additional host failure".

Symptoms:

vSphere reports the following warning or error for the vSAN cluster - "What if the most consumed host fails":
On vSAN Skyline Health the following warning or error is seen - "What if the most consumed host fails":

Environment

VMware vSAN (All Versions)

Cause

Component utilization or Disk space utilization has reached the threshold in the scenario of a single host failure.

Threshold

Component utilization

Red: Component utilization after a single host failure > 90 %
Yellow: Component utilization after a single host failure > 80 %

Disk space utilization (Capacity utilization)

Red: Disk space utilization after a single host failure > 90 %
Yellow: Disk space utilization after a single host failure > 70 %

In versions of vCenter Server 7.0 U2 and later, the disk space usage threshold can be customized under “Reservations and Alerts”.

Resolution

Q: What does the "Capacity utilization Health – What if the most consumed host fails" (Formerly "Limits Health – After one additional host failure") check do?

In addition to the basic limit health check, there is also a simulation of how resources would look like after an ESXi host failure has occurred. If a single ESXi host fails, two things can happen. First, the resources on that ESXi host (such as cache and capacity) are no longer available. Second, vSAN attempts to re-protect (rebuild) all components belonging to objects that are now currently running with reduced redundancy due to the failure.

This health check simulates both actions described above. If the ESXi host with the most resources consumed fails, this health check calculates how much resources would be used from the remaining hosts in the cluster, and how much resources would still be available.

Note: If there is already a failure in the cluster, this test will report on one additional failure. Therefore, this test reports on the results of the current failure and the additional failure that it introduces.

In vSphere 6.7 Update 3 and later releases the Health check name is updated to "Capacity Utilization"

Q: What does it mean when it is in an error state?

If this check reports that after a host failure, more than 100% of resources will be used, it means that re-protection fails for some objects because there are not enough resources available.

Note: This health check simulation is very simple. It only looks at cluster aggregate resources, so just like the basic limits check, it does not consider the distribution and placement rules.

However, this simple simulation will verify that, after a failure, a vSAN cluster has been configured with enough resources to operate in an operationally safe manner after a re-protection. This test does not check for balance and fault domains, so these needs to be considered independently of this test.

For example, a user may enforce an operational business policy to have no less than 25% free disk space under normal conditions and no less than 15% free disk space after one failure. This check can be used to implement such a policy and to verify that this is indeed the case.

Q: How does one troubleshoot and fix the error state?

There is no troubleshooting involved in this health check. It is primarily for information only. If this health check fails, you may wish to add additional resources to the cluster to facilitate a successful rebuild after a failure. If you feel that there should be enough capacity in the cluster to rebuild after a failure, check to see if any of the components such as Disks drives are in a failed state.

Monitor vSAN Capacity