Managing Stretched Cluster Fault Domains space utilization
search cancel

Managing Stretched Cluster Fault Domains space utilization

book

Article ID: 387330

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

It's imperative to properly monitor the space utilization of a stretch cluster, otherwise it can result in running either low on space or out of space in a single fault domain while the other fault domain is barely used.

Possible issues:

  • VMs fail to power on with error: There is no more space for virtual disk
  • VMs fail to deploy/create with not enough space on the vSAN datastore 

However the overall vSAN datastore is showing plenty of free space.

Environment

VMware vSAN

Cause

These failures are caused due to one site being used more than the other site resulting in one site filling up as seen below

To see all the storage policies in use and the locality set on the policies run the below command:
esxcli vsan debug object list --all|grep -E 'spbmProfileName|locality'|less

To see how many objects are using which storage policy run:
esxcli vsan debug object list --all|grep spbmProfileName |awk '{print $2}'|sort|uniq -c

To see how many objects are set to which fault doamin run:
esxcli vsan debug object list --all|grep locality |awk '{print $2}'|sort|uniq -c

Resolution

If one site is full follow one of the below options:

  1. If possible delete anything from the datastore that resides in the full fault domain no longer needed 
  2. If the hosts have empty disk bays add more disks
  3.  If the hosts don't have any empty disk bays add more hosts to the cluster.
    Note: When adding more disks/hosts to a cluster it's preferred to keep both sites homogeneous with the same amount of hosts/disk on both sites.
  4. sVmotion VMs from the full site off the vSAN datastore to alternate storage
  5. Follow KB Procedures for identifying Unassociated vSAN objects. to see if any space can be freed up by deleting unassociated objects no longer needed.

If one site is low on space but not completely full then follow one of the below options:

  1. If possible delete anything from the datastore that resides in the fault domain that is low on space no longer needed
  2. If the hosts have empty disk bays add more disks
  3. If the hosts don't have any empty disk bays add more hosts to the cluster.
    Note: When adding more disks/hosts to a cluster it's preferred to keep both sites homogeneous with the same amount of hosts/disk on both sites.
  4. sVmotion VMs from the full site off the vSAN datastore to alternate storage
  5. Follow KB Procedures for identifying Unassociated vSAN objects. to see if any space can be freed up by deleting unassociated objects no longer needed.
  6. Change the policy on objects with the locality set to the fault domain low on space to the policy for the locality of the other fault domain if possible.
    Note: If you keep the VMs residing on hosts in the fault domain low on space and the policy was changed to set the locality to the other fault domain there may be a performance hit on the VM as the reads/writes have to traverse the ISC (Intersite connection). So it's best once the policy is change you vMotion the VM to the other site so the compute resources resides with the storage resources.