Resolving Storage Capacity Issues in a vSAN Datastore / VMs are not running despite free space on vSAN Datastore

Products

VMware vSAN

Issue/Introduction

It's important to monitor the vSAN cluster to ensure it doesn't become too full. There are Health Service alerts in place to help keep track of capacity utilization. Please see: vSAN Health Service - Capacity utilization - Disk space and vSAN Health Service - Physical Disk Health - Disk Capacity for more information.

If a vSAN datastore becomes too full, it can cause issues such as resyncs being stuck and certain management tasks to timeout and get stuck.

Other symptoms can include:

VMs not running despite having free space on vSAN Datastore.

You have free space on Datastore but still VM are not running given space issue
You will see the warning 'There is no more space for virtual disk' on the VM summary tab on vSphere Client
Objects fail to create
Thin provisioned VMs cannot extend their vmdks due to lack of space
vSAN needs space for operations, called slack space, and it is recommended to not exceed 80% utilization on a vSAN datastore. If it becomes full, there are a few methods one can use to clear space to perform necessary management tasks.
Hosts can fail to go into Maintenance Mode when using either "Ensure Accessibility" or "Full Data Migration".

vSAN Free Capacity recommendations: Understanding reserved capacity concepts in vSAN

Environment

VMware vSAN (All Versions)

Cause

vSAN Datastore becoming too full

Resolution

The only recommended solutions are the following:

Add Capacity to the datastore by adding more capacity disks or adding additional nodes to the cluster with disk groups.
Clear up space by deleting unused or unnecessary data
Power-off non-critical virtual machine

If this is not possible, there are some methods that can be done to clear space for management tasks such as adding more disks.

It's possible there are unassociated objects that are taking up space and can be removed. Unassociated means its an object that is currently not attached to a currently registered VM. It does NOT mean it's not in use.

See the following KB for steps on identifying unassociated objects: Procedures for Identifying unassociated objects
Convert some Raid 1 objects to Raid 0. NOTE: It is not recommended to ever edit the vSAN Default Storage Policy, or changing already existing policies. Instead create new policies and apply them to a few VMs at a time.

Due to how vSAN handles storage policy changes with Raid 1 (identical mirrors) temporarily changing a large vmdk object to Raid 0 can clear a lot of space by removing one of those mirrors. This means that that object will have no redundancy while it is raid 0, so it's only recommended as a temporary measure to address the underlying problem. If using Raid 5 or Raid 6, this method will not work, because in order to convert an object from erasure coding to Raid 0 it would have to build a completely new object, which would only make the problem worse.
This can also be used for certain VMs that have fault tolerance built into their application by nature, such as having primary and secondary VMs where one takes over if the other fails.
Please see: How vSAN handles Policy Changes between RAID1 Mirroring and RAID 5/6 Storage Policies.
Verify storage policy in use, in rare cases some objects previously migrated to vSAN datastore may have Storage Rule 'proportionalCapacity = 100' (thick) incorrectly assigned.

To identify such objects user should run the following commands:

cmmds-tool find -f python | grep 'proportionalCapacity\\\": 100' -B9 | grep uuid | cut -d "\"" -f4 >> /tmp/uuidlist.txt

(Note: this command creates a file in /tmp/uuidlist.txt with all the objects with 'proportionalCapacity = 100' rule.)

for i in $(cat /tmp/uuidlist.txt); do echo "*********************";echo; /usr/lib/vmware/osfs/bin/objtool getAttr -u $i |grep -i -E '^UUID|Object path'; echo; done

(Note: this command outputs 'UUID <-> path' pairs based on previously created /tmp/uuidlist.txt file)

Based on the friendly names ('Object path') in the output user could determine list of good candidates (UUID) for conversion to thin, once completed (Re)apply Storage Policy in the UI by assigning Storage Policy with same characteristics object(s) already has (Failures to Tolerate, etc.) and with 'proportionalCapacity' rule set to '0'.
Convert RAID-1 objects to RAID-5. Create a vSAN storage policy configured with thin provisioning and RAID-5 Erasure Coding. Set the smaller VMs a few at a time to a RAID-5 storage policy. Once some space is reclaimed, incrementally choose the midsized VMs one or two at a time. Once it is determined that enough space has been reclaimed, the extremely large VMs can be converted to RAID-5 one at a time. RAID-6/FTT=2 can also be used. However RAID-5 will save more space than RAID-6. While RAID-6 offers a fault tolerance of 2, it comes with a bit of a write performance hit due to having to write to 2 extra components vs RAID-5. RAID-5 will use 1.33x space. RAID-6 will use 1.5x space. RAID-1 will use 2.0x space. Thus RAID-5 will allow for the largest space savings while still maintaining performance required by most VMs.

See KB Change the vSAN storage policy on VMs to save space for more info.

Note: Options 3 and 4 requires some free space in the vSAN datastore these options won't work for a full datastore.

Resolving Storage Capacity Issues in a vSAN Datastore / VMs are not running despite free space on vSAN Datastore

Article ID: 372112

Updated On:

Products

Issue/Introduction

Environment

Cause

Resolution

Feedback