vSAN throws out of space errors for creation of VMs or any reconfigurations during upgrade from 6.5.x to 67U3

Products

VMware vSphere ESXi VMware vSAN

Issue/Introduction

The issue is due to difference in perception of various capacity used stats across nodes with different ESXi versions.
The issue occur only during the upgrade from ESXi 6.5.x to 6.7U3 or if any one host in cluster is at ESXi 6.7U3 version ..
The log message would show higher disk usages on upgraded host while lower disk usages on non-upgraded hosts.
Following log message are seen under /var/run/log/clomd.log file

Example :
- Upgraded node - indicating that the disks are 100% full
  
  YYYY-MM-DDThh:mm:ss(114079209216)(opID:0)CLOMMemoryTest: Cluster usages - usableDataMDs:46hFull[P]:1.000 meanFull[P]:0.845] lFull[P]:0.177] hFull[S]:0.000] meanFull[S]:0.000] lFull[S]:0.000] capTotal:74569760268288 capUsed:63017823309446 ssdCapTotal:7201504601088 ssdCapUsed:0
- Non-upgraded node - indicating disks 47% full:
  
  YYYY-MM-DDThh:mm:ss(171097454880)(opID:0)CLOMMemoryTest: Cluster usages - usableDataMDs:46 highestFullness:0.472 meanFullness:0.407 lowestFullness:0.176 capTotal:74569760268288 capUsed:30323956320563 ssdCapTotal:14680064 ssdCapUsed:0
This issue can also be occur if some objects in cluster with OSR=100.
Run the cmmds-tool identify the issue. Query any capacity tier disk which is owned by non-upgraded host from upgraded host as well as non-upgraded host and compare the results.

cmmds-tool find -f python -t DISK_USAGE -o <host uuid of non-upgraded host>:

Example
- cmmds-tool find -f python -t DISK_USAGE -o 5e28b18c-XXXX-XXXX-XXXX-XXXXXXXXX469
- Entry of a disk on upgraded host:
  {
     "uuid": "52752e85-XXXX-XXXX-XXXX-XXXXXXXXX124",
     "owner": "5e28b18c-XXXX-XXXX-XXXX-XXXXXXXXX469", <--- host uuid of non-upgraded host
     "health": "Healthy",
     "revision": "626",
     "type": "DISK_USAGE",
     "flag": "2",
     "minHostVersion": "3",
     "md5sum": "f88ec86031000aa1d44bb19ae0683e7d",
     "valueLen": "200",
     "content": "{\"capacityReserved\": 26705133568, \"iopsReserved\": 0, \"throughPutReserved\": 0, \"l2CacheReserved\": 0, \"l1CacheReserved\": 0, \"addressSpaceSize\": 300228280320, \"nsAddressSpaceSize\": 273804165120, \"numTotalComponents\": 5, \"physCapacityUsed\": 713031680, \"logicalCapacityReserved\": 26705133568, \"physDiskCapacityReserved\": 26705133568, \"logicalCapacityRequested\": 0, \"dgLogicalCapacityRequested\": 26705133568}",
     "errorStr": "(null)"
  },
- Entry of same disk on non-upgraded host:
  
  {
     "uuid": "52752e85-XXXX-XXXX-XXXX-XXXXXXXXX124",
     "owner": "5e28b18c-XXXX-XXXX-XXXX-XXXXXXXXX469", <--- host uuid of non-upgraded host
     "health": "Healthy",
     "revision": "627",
     "type": "DISK_USAGE",
     "flag": "2",
     "minHostVersion": "3",
     "md5sum": "f88ec86031000aa1d44bb19ae0683e7d",
     "valueLen": "200",
     "content": "{\"capacityReserved\": 26705133568, \"iopsReserved\": 0, \"throughPutReserved\": 0, \"l2CacheReserved\": 0, \"l1CacheReserved\": 0, \"addressSpaceSize\": 300228280320, \"nsAddressSpaceSize\": 273804165120, \"numTotalComponents\": 5, \"physCapacityUsed\": 713031680, \"logicalCapacityReserved\": 26705133568, \"physDiskCapacityReserved\": 26705133568, \"logicalCapacityRequested\": 0, \"physDiskCapacityRequested\": 26705133568}",
     "errorStr": "(null)"
  },
- When comparing the outputs, one can see that the last attribute on upgraded host is dgLogicalCapacityRequested and last attribute on non-upgraded host is physDiskCapacityRequested. values for both of these attributes are same across hosts.

Environment

VMware vSphere 6.7.x
VMware vSphere 6.5.x

Cause

The issue is due to some new cmmds attributes added in 67U3 release. When hosts are upgraded, the non-upgraded hosts propagate cmmds attributes for disks owned by them to upgraded hosts and because upgraded hosts have some new attributes added in between existing attributes the updates received from non-upgraded hosts land into incorrect positions within the cmmds entries of upgraded hosts causing irrelevant disk stats representation on upgraded hosts.

Resolution

The issue persists until all nodes in the cluster are upgraded. To make progress there are a couple of options to consider:

Power off all VMs and then upgrade all hosts by putting hosts in EMM-NoAction.
Add extra resources hosts/disks to the cluster.

Note: The issue is fixed in vSphere 7.0GA/6.7U3g and newer version.