vSAN Health Service - Capacity utilization

Products

VMware vSAN

Issue/Introduction

This article explains the Capacity Utilization Health – Storage space check in the vSAN Health Service and provides details on why it might report an error.

Note: In VC versions < 7.0 U2, this health check was named "Capacity Utilization Health – Disk space".

Environment

VMware vSAN (All Versions)

Resolution

Q: What does the "Capacity utilization health – Storage space" (Formerly "Disk space") check do?

This particular health check looks at cluster level disk usage. It triggers a warning when the raw capacity usage exceeds the threshold and ensures that the usage does not surpass the threshold.

Q: What does it mean when it is in an error state?

For vSAN, If this check reports a warning, it indicates the cluster may not have enough capacity to repair after a failure. If it reports an error, it means the cluster may not be able to perform internal operations like rebalancing or user operations such as policy and cluster configuration changes. If the utilization is extremely high, the IO operations from the workloads may fail.

For vSAN Direct, if this check has a warning or error, it may mean there is no enough free storage capacity, and the creation of CNS volumes may fail.

Reserved Capacity (8.x and newer)

In vSAN, Reserved Capacity refers to the storage space set aside for transient activities like rebalancing, maintenance tasks, and handling host failures to ensure system stability and performance. It consists of two parameters:

Operations Reserve(OR)
Host Rebuild Reserve (HRR)

Reserved Capacity can be enabled via toggles in the “Reservations and Alerts” section of the vSAN Capacity Overview.

Operations Reserve(OR)

The Operations Reserve is reserved for transient storage activities (e.g., policy changes, rebalancing). The percentage is determined by general assumptions about object size and factors such as the raw size of capacity devices, number of devices per host, disk groups per host, and use of cluster-based Deduplication & Compression (DD&C). In vSAN, OR does vary by hardware configuration and software services.

For example:

A cluster host using 4TB capacity devices with 8 capacity devices and 2 disk groups per host, and DD&C enabled may have an OR equal to 6% of total capacity.
A cluster host using 4TB capacity devices with 2 capacity devices and 1 disk group per host, and DD&C disabled may have an OR equal to 17% of total capacity.

To calculate Operations Reserve:

For vSAN ESA cluster:
Operation Reserve = <number of hosts> * 765GB + <number of capacity drives> * (MIN(<capacity drive size in GB> * 0.05, 100) + MIN(<capacity drive size in GB> * 0.0025, 100))
For vSAN OSA cluster:
If the storage efficiency preference is "Compression Only" or "None", Operation Reserve = <number of hosts> * 765GB + <number of capacity drives> * MIN(<capacity drive size in GB> * 0.05, 100)

If the storage efficiency preference is "Dedup + compression", Operation Reserve = <number of hosts> * 765GB + <number of disk groups> * MIN(<disk group size in GB> * 0.05, 100)

Host Rebuild Reserve (HRR)

The Host Rebuild Reserve (HRR) is reserved for single host failure recovery. Its calculation is fully independent of the Operations Reserve. The percentage is calculated by the capacity of one host relative to the total host count in the cluster, supporting an N+1 design strategy. In vSAN, HRR decreases as the number of hosts increases.

For example:

A 4-node cluster may have a HRR equal to 25% of total capacity.
A 12-node cluster may have a HRR equal to 8% of total capacity.
A 32-node cluster may have a HRR equal to 3% of total capacity.

To determine Reserved Capacity (Operations Reserve or Operations Reserve + Host Rebuild Reserve) for new clusters, use the vSAN ReadyNode Sizer or VxRail Sizer, as these tools account for vSAN version-specific optimizations and corner cases (e.g., capping reserved capacity at <30% for small clusters). Please refer to Understanding Reserved Capacity Concepts in vSAN for more information when calculating the Reserved Capacity.

Thresholds definition

On this page, vSAN ops threshold and host rebuild threshold are defined as below:

vSAN ops threshold = Total space - Operations reserve
Host rebuild threshold = Total space - Operations reserve - Host rebuild reserve

Note: In some cases the Host rebuild threshold can be equal to the vSAN ops threshold, such as in VMC deployments, because the feature of host rebuild reserve is not needed for VMC deployments.

The thresholds definition is:

When host rebuild reserve is disabled:

Red: MIN(90% of total capacity, vSAN ops threshold)
Yellow: = MIN(70% of total capacity, Host rebuild threshold)

When host rebuild reserve is enabled:

Red: MIN(90% of total capacity, Host rebuild threshold)
Yellow: = MIN(70% of total capacity, 80% of Host rebuild threshold)

Note: When a custom threshold is configured, it takes precedence over the default vSAN capacity health threshold. The user-defined value will always be honored. For more details, refer to Configure Reserved Capacity for vSAN Cluster

Q: How does one troubleshoot and fix the error state?

The first step is to ensure that all the storage is valid, that there are no missing capacity devices, and ensure that the vSAN datastore capacity is what you expect it to be. There a 3 options to recover a full vSAN datastore situation as detailed below:

Download VM(s) to local workstation to free up space on vSAN datastore:
1. Identify the full disk/disk group.
2. Identify VM(s) residing on the full disk group.
  1. VM must be healthy. If they are orphaned or inaccessible, then select a different VM to free up space.
  2. VM must have namespace reservation usage of less than 10MB.
3. Power off [for VMs with a pending question, answer 'cancel'].
  1. There may be multiple questions - all must be answered.
  2. If hostd is non-responsive/hung, this method may not work.
4. Download the VM to your local workstation.
  1. verify download is complete and contains all VM data.
5. Delete the VM on vSAN.
Power-off VMs to free up resources on vSAN:
1. Gracefully power off VM(s). This will delete the swap files associated with those guests and free up some space.
2. Once enough VMs have been powered off, svMotion VMs to alternate storage.
3. If alternate storage is not available, delete unnecessary guests to free up additional space.
Expand vSAN Direct datastore capacity.
If vSAN servers have available slots for additional drives and disks are available, add disks to the vSAN Direct storage to expand capacity. This is typically the easiest option if the appropriate resources are available.

Caveats

vSAN has limited functionality when full.
- When disk groups in the cluster are full, VMs that reside on those disk groups will experience failures for any operations that require space, such as storage vMotion. Other operations may take significantly more time to complete than they would in a healthy cluster.
Recovered space may not be reflected in the UI.
- When pursuing methods of recovery that delete swap files above, manually confirm via CLI.