vSAN Health Service - Cluster health - vSAN optimal datastore default policy configuration

Products

VMware vSAN

Issue/Introduction

This article explains the Cluster health - vSAN optimal datastore default policy configuration check in the vSAN Health Service and provides details on why it might report the error and how to fix the warning/error state.

Environment

VMware vSAN 8.0 U1 and higher

Resolution

Q: What does the Cluster Health – vSAN optimal datastore default policy configuration check do?

This health test will check if the cluster's current datastore default policy is optimal or not. The optimal policy for different cluster types and sizes can be referred to in the below table.
Note: EMM = Enter Maintenance Mode, HFTT = Host Failures to Tolerate, SFTT = Site Failures to Tolerate

Type	Number of Nodes	Recommended FTT		Details	Host EMM and Remove Operation Impact
Type	Number of Nodes		With node reservation
Standard cluster	3	HFTT=1 failure - RAID-1 (Mirroring) SFTT=None - standard cluster	N/A	Use existing Default vSAN policy	Keep the current behavior
	4	HFTT=1 failure - RAID -5 (Erasure Coding) SFTT=None - standard cluster	HFTT=1 failure - RAID-1 (Mirroring) SFTT=None - standard cluster	Create new RAID-5 policy	User can put one host in EMM using EnsureAcc. Can not remove node from cluster with full data evac.
	5	HFTT=1 failure - RAID -5 (Erasure Coding) SFTT=None - standard cluster	HFTT=1 failure - RAID -5 (Erasure Coding) SFTT=None - standard cluster	Create new RAID-5 policy	User can put one host in EMM using EnsureAcc. Can not remove node from cluster with full data evac
	6	HFTT=2 failures - RAID-6 (Erasure Coding) SFTT=None - standard cluster	HFTT=1 failure - RAID -5 (Erasure Coding) SFTT=None - standard cluster	Create new RAID-6 policy.	User can put one host in EMM using EnsureAcc. Can not remove node from cluster with full data evac.
	7 and more	HFTT=2 failures - RAID-6 (Erasure Coding) SFTT=None - standard cluster	HFTT=2 failures - RAID-6 (Erasure Coding) SFTT=None - standard cluster	Create new RAID-6 policy.	For 7 nodes: User can put two hosts in EMM using EnsureAcc. Can remove 1-node from cluster with full data evac.
Stretched cluster	If nodes on each side <=2	HFTT=No data redundancy SFTT=Site mirroring - stretched cluster (To tolerate n failure, needs 2n+1 hosts in each cluster site)	N/A	Create new vSAN ESA stretched cluster policy	Existing behavior.
	If nodes on each side ==3	HFTT=1 failure - RAID-1 (Mirroring) SFTT=Site mirroring - stretched cluster	N/A	Create new vSAN ESA stretched cluster policy	Existing behavior.
	If nodes on each side >=4 and <= 5	HFTT=1 failure - RAID -5 (Erasure Coding) SFTT=Site mirroring - stretched cluster	N/A	Create new vSAN ESA stretched cluster policy RAID-5 policy	User can put one host in EMM using EnsureAcc. Can not remove node from cluster with full data evac
	If nodes on each side >= 6	HFTT=2 failures - RAID-6 (Erasure Coding) SFTT=Site mirroring - stretched cluster	N/A	Create new vSAN ESA stretched cluster R-6 policy	For 6 Nodes: User can put one host in EMM using EnsureAcc. Can not remove node from cluster with full data evac. For 7 nodes: User can put two hosts in EMM using EnsureAcc. Can remove 1-node from cluster with full data evac.
2-node Stretch	2, Fixed configuration	HFTT=No data redundancy SFTT=Site mirroring - stretched cluster	N/A	Use existing Default vSAN policy	Existing behavior.

Note: If using Host mirroring - 2 node cluster, SFTT = 1 and HFTT = 1 and requires a minimum of 3 disk groups per data host or 3 disks in a storage pool

Note: vCenter equivalent options for Standard Clusters

HFTT = 0 - FTT = No data redundancy, No data redundancy with host affinity
HFTT = 1 - FTT = 1 failure - RAID-1 (Mirroring), 1 failure - RAID -5 (Erasure Coding)
HFTT = 2 - FTT = 2 failures - RAID-1 (Mirroring), 2 failures - RAID-6 (Erasure Coding)
HFTT = 3 - FTT = 3 failures - RAID-1 (Mirroring)
Site disaster tolerance = None - standard cluster

vCenter equivalent options for Stretched Clusters
SFTT = 1 - Site disaster tolerance = Host mirroring - 2 node cluster, Site mirroring - stretched cluster
HFTT = 0 - FTT = No data redundancy, No data redundancy with host affinity
HFTT = 1 - FTT = 1 failure - RAID-1 (Mirroring), 1 failure - RAID -5 (Erasure Coding)
HFTT = 2 - FTT = 2 failures - RAID-1 (Mirroring), 2 failures - RAID-6 (Erasure Coding)

Q: What does it mean when it is in a warning state?

When in a warning state, it means that the cluster's current datastore policy is not optimal. The test table has five columns: policy name | rule name | current value | suggested value| status. The table has two rows: 1st row is for "Failure to tolerate" rule and 2nd row is for "Site disaster tolerance" rule. Any row's status in a warning state means the current rule value does not match the suggested rule value.

Q: How does one troubleshoot and fix the error state?

One should go to "Policies and Profiles", select "VM Storage Policy" and click the policy name in the health test table. Then edit the "Failure to tolerate" rule or "Site disaster tolerance" rule using the suggested value shown in the health test table.