vSAN and vSphere maintenance modes may diverge

Products

VMware vSAN

Issue/Introduction

All vSAN nodes contain a decommission state. Typically, nodes that are decommissioned are also in vSphere maintenance mode, though these states can diverge.

Symptoms:

VMware vSAN nodes have two overall states - commissioned and decommissioned. A decommissioned node is in a vSAN-specific maintenance mode. In some rare scenarios, a node can be decommissioned but not in maintenance mode. This can be identified by assessing the decommission state from within the vSAN directory.

When a node is decommissioned, it is a participant in the vSAN cluster, but its resources (disks, components, etc.) are unavailable. As a result, the node is in normal vSphere production, but may not be contributing resources to the vSAN cluster. This can manifest in multiple ways:

The vSAN datastore capacity is smaller than expected
Resync may not make progress
Objects may be inaccessible

Environment

VMware vSAN (All Versions)

Resolution

If a node is decommissioned but the host is not in maintenance mode, place the host into maintenance mode with No Data Migration and then remove it from maintenance mode to reset the vSAN decommission status and make it consistent with the vSphere maintenance mode state.

The vSAN decommission status is tracked in the vSAN cluster directory (CMMDS). The decommission state and job type reveal the status of the node from the vSAN standpoint.

For all versions of vSAN, the decommission state can be queried from CMMDS. The decommission state is recorded in the directory, in the "NODE_DECOM_STATE" directory entry. This directory entry, when formatted as JSON, looks like:

{
"entries":
[
{
"uuid": "57786c67-8501-bb75-7a1e-005056af00a0",
"owner": "57786c67-8501-bb75-7a1e-005056af00a0",
"health": "Healthy",
"revision": "28",
"type": "NODE_DECOM_STATE",
"flag": "2",
"minHostVersion": "0",
"md5sum": "3c2593056659ee3c9e97039a3eefea8e",
"valueLen": "80",
"content": {"decomState": 0, "decomJobType": 0, "decomJobUuid": "00000000-0000-0000-0000-000000000000", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 0},
"errorStr": "(null)"
}
]
}

Note: For commands to examine the decommission state for nodes in a vSAN cluster using the CMMDS tool, see the Additional Information section.

The decommission state and job type (highlighted in the above sample) reveal the decommission state and type.

For more information about the various decommission states and job types, refer to the tables below:

Decommission State	Meaning
0	None - the node is not decommissioned
1	The decommissioning process has been started
3	The decommissioning process is underway
6	The node has been decommissioned

Decommission Job Type*	Meaning
0	The node has been decommissioned in the "No Data Migration" mode.
1	The node has been decommissioned in the "Ensure Accessibility" mode.
2	The node has been decommissioned in the "Full Data Migration" mode.

* The decommission job type value is only meaningful if the decommission state is non-zero!

Additional Information

To examine the decommission state of a node in a vSAN cluster, use the following procedure:

Log in to the ESXi host via SSH or local console (physical or KVM console)
Get the host UUID by running cmmds-tool whoami
Query the node's decommission state command:
# cmmds-tool find -t NODE_DECOM_STATE -f json -u <host_uuid>

Example output:
{
"entries":
[
{
"uuid": "57786c67-8501-bb75-7a1e-005056a######",
"owner": "57786c67-8501-bb75-7a1e-005056a######",
"health": "Healthy",
"revision": "37",
"type": "NODE_DECOM_STATE",
"flag": "2",
"minHostVersion": "0",
"md5sum": "4c62595402a792589d69ef8ba5e952b6",
"valueLen": "80",
"content": {"decomState": 6, "decomJobType": 1, "decomJobUuid": "b6fd5a5b-db8d-46b1-3999-2e0e300######", "progress": 0, "affObjList": [ ], "errorCode": 0, "updateNum": 0, "majorVersion": 2},
"errorStr": "(null)"
}
]
}
In this example, the node has been decommissioned using the Ensure Accessibility mode.

To examine the decommission state of all healthy nodes in the vSAN cluster, use the following procedure:

Log in to the ESXi host via SSH or local console (physical or KVM console)
Query all healthy nodes' decommission state via CMMDS, using the cmmds-tool command:
# echo "hostname,decomState,decomJobType";for host in $(cmmds-tool find -t HOSTNAME -f json |grep -B2 Healthy|grep uuid|awk -F \" '{print $4}');do hostName=$(cmmds-tool find -t HOSTNAME -f json -u $host|grep content|awk -F \" '{print $6}');decomInfo=$(cmmds-tool find -t NODE_DECOM_STATE -f json -u $host |grep content|awk '{print $3 $5}'|sed 's/,$//');echo "$hostName,$decomInfo";done|sort

Example output:
hostname,decomState,decomJobType
vsanhost-1 ,0,0
vsanhost-2 ,0,0
vsanhost-3 ,6,1
vsanhost-4 ,0,0
vsanhost-5 ,0,0
vsanhost-6 ,0,0
In this example, we see that node vsanhost-3 is fully decommissioned, and it was decommissioned in the Ensure Accessibility mode.

Note: The number of nodes reported here should match the number of nodes in the vSAN cluster.