How to deal with a full vSAN Datastore

Products

VMware vSAN

Issue/Introduction

It's important to monitor the vSAN cluster to ensure it doesn't become too full. There are Skyline Health alerts in place to help keep track of capacity utilization. Please see: vSAN Health Service - Capacity utilization - Disk space and vSAN Health Service - Physical Disk Health - Disk Capacity for more information.

If a vSAN datastore becomes too full, it can cause issues such as resyncs being stuck and certain management tasks to timeout and get stuck.

Other symptoms can include:
VMs fail to power on or go offline due to lack of space
Objects fail to create
Thin provisioned VMs cannot extend their vmdks due to lack of space

vSAN needs space for operations, called slack space, and it is recommended to not exceed 80% utilization on a vSAN datastore. If it becomes full, there are a few methods one can use to clear space to perform necessary management tasks.

More information on slack space: vSAN Operations: Maintain Slack Space for Storage Policy Changes

vSAN Free Capacity recommendations: Revisiting vSAN’s Free Capacity Recommendations

Environment

VMware vSAN 7.0.x

VMware vSAN 8.0.x

Cause

vSAN Datastore becomes full capacity utilization

Resolution

The only recommended solutions are the following:

Add Capacity to the datastore by adding more capacity disks or adding additional nodes to the cluster with disk groups.
Clear up space by deleting unused or unnecessary data
Power-off not critical virtual machine

If this is not possible, there are some methods that can be done to clear space for management tasks such as adding more disks.

It's possible there are unassociated objects that are taking up space and can be removed. Unassociated means its an object that is currently not attached to a currently registered VM. It does NOT mean it's not in use.

See the following KB for steps on identifying unassociated objects: Procedures for Identifying unassociated objects
Convert some Raid 1 objects to Raid 0. NOTE: It is not recommended to ever edit the vSAN Default Storage Policy, or changing already existing policies. Instead create new policies and apply them to a few VMs at a time.

Due to how vSAN handles storage policy changes with Raid 1 (identical mirrors) temporarily changing a large vmdk object to Raid 0 can clear a lot of space by removing one of those mirrors. This means that that object will have no redundancy while it is raid 0, so it's only recommended as a temporary measure to address the underlying problem. If using Raid 5 or Raid 6, this method will not work, because in order to convert an object from erasure coding to Raid 0 it would have to build a completely new object, which would only make the problem worse.

Please see: How vSAN handles Policy Changes between RAID1 Mirroring and RAID 5/6 Storage Policies.
Verify storage policy in use, in rare cases some objects previously migrated to vSAN datastore may have Storage Rule 'proportionalCapacity = 100' (thick) incorrectly assigned.

To identify such objects user should run the following commands:

cmmds-tool find -f python | grep 'proportionalCapacity\\\": 100' -B9 | grep uuid | cut -d "\"" -f4 >> /tmp/uuidlist.txt

(Note: this command creates a file in /tmp/uuidlist.txt with all the objects with 'proportionalCapacity = 100' rule.)

for i in $(cat /tmp/uuidlist.txt); do echo "*********************";echo; /usr/lib/vmware/osfs/bin/objtool getAttr -u $i |grep -i -E '^UUID|Object path'; echo; done

(Note: this command outputs 'UUID <-> path' pairs based on previously created /tmp/uuidlist.txt file)

Based on the friendly names ('Object path') in the output user could determine list of good candidates (UUID) for conversion to thin, once completed user can approach conversion of desired objects in one of two ways:

1. (Re)apply Storage Policy in the UI by assigning Storage Policy with same characteristics object(s) already has (Failures to Tolerate, etc.) and with 'proportionalCapacity' rule set to '0'.

2. Convert objects one by one using CLI, example command (can be run on any ESXi):

/usr/lib/vmware/osfs/bin/objtool setPolicy -u <uuid> -p "((\"proportionalCapacity\" i0))"