These entries indicate that the decom state has failed shortly after attempting to enter maintenance mode.
CLOM_ProcessDecomUpdate: Node 5772cd13-####-####-####-########ab7 state change. Old:DECOM_STATE_STARTED New:DECOM_STATE_FAILED Mode:2 JobUuid:896de263-92f5-a267-f43e-13f1beb10e75 CLOM_ProcessDecomUpdate: Node 5772cd13-####-####-####-########ab7 state change. Old:DECOM_STATE_FAILED New:DECOM_STATE_NONE Mode:0 JobUuid:00000000-0000-0000-0000-000000000000
You see a decommission error on a helper node with the error code out of resources.
CLOMDecomFailureCb: Saw decommission error on some helper node CLOMDecomFailDecommissioning: Failing Decommissioning. Error code Out of resources
Note: This clomd log error can also appear in specific situations where the on-disk format is below version 3.
Note: A helper node in this context refers to another hypervisor node in the vSAN cluster.
You also see errors similar to:
CLOM_LogOpReport: OpReport: type:4 status:Not found reason:11 Total nodes:8 Total disks:32 disksUsable:58 phyDisksUsable:20 disksNeeded:74 disks decommisionning:0 disks with Node decommisionning:4 unhealthy disks:0 disks with unhealthy ssds or nodes:0 disks with insufficient space:5 disks with max components:0 disks with storage type mismatch:0 disks with version mismatch:0 disks in witness nodes:0 disk failed affinity/anti-affinity rule:0 Total fds:8 fdsUsable:0 fdsNeeded:0 capacityUsable: 0 capacityNeeded: 0 CLOMDecom_PublishFailure: Failed to decommission 7831ab57-####-####-####-########ac4with error Underlying device has no free space CLOMDecomPublishReconfigFailure: Adding FAILED entry for uuid 896de263-####-####-####-########e75
This issue occurs when there is an issue finding enough resources in the cluster to move the data required to enter maintenance mode. Clomd should report which object(s) we do not have enough resources to move, along with an error message detailing what is needed.
Note: VMware recommends 25-30% “slack space” or free space for tasks and scenarios such as virtual machine snapshots, disk utilization rebalancing, hardware failures, and fault tolerance method reconfiguration.
Resolution
To resolve this issue, use any one of these options:
Add more disks or disk groups to the cluster to provide enough contiguous space (Recommended).
Increase the stripe width of the object to decrease the component sizes at the cost of additional components (Recommended).
Change the FTT configuration of the vmdk object to 0 to instantaneously recover redundant capacity at the cost of protection.
Additional Information
Note: vSAN has a distributed architecture, check the clomd.log from all hosts to locate which objects are preventing the host from entering maintenance mode.
For example:
Placing host1 into maintenance mode fails with this error:
CLOMDecomFailureCb: Saw decommission error on some helper node
CLOMDecomFailDecommissioning: Failing Decommissioning. Error code Out of resources
After checking all hosts, the problem is caused because there is not enough contiguous space to relocate the components from the host entering maintenance mode to the other hosts in the cluster. Specifically, it requires 4 more disks with enough contiguous capacity to hold these components.
CLOM_LogOpReport: OpReport: type:4 status:Not found reason:11
Total nodes:8 Total disks:32 disksUsable:58 phyDisksUsable:20 disksNeeded:74 disks decommisionning:0 disks with Node decommisionning:4 unhealthy disks:0 disks with unhealthy ssds or nodes:0 disks with insufficient space:5 disks with max components:0 disks with storage type mismatch:0 disks with version mismatch:0 disks in witness nodes:0 disk failed affinity/anti-affinity rule:0
Total fds:8 fdsUsable:0 fdsNeeded:0 capacityUsable: 0 capacityNeeded: 0
CLOMDecom_PublishFailure: Failed to decommission ########-####-####-####-########3ac4with error Underlying device has no free space
CLOMDecomPublishReconfigFailure: Adding FAILED entry for uuid 896de263-####-####-####-########e75