Upgrading vSAN On-Disk format to 3.0 may fail in small vSAN clusters or 2-node stretched clusters

Products

VMware vSAN VMware vSphere ESXi

Issue/Introduction

The vSAN on-disk upgrade fails in some clusters if the selected storage policies cannot be satisfied during the upgrade process. Such clusters will need special parameters to complete the upgrade.

Symptoms:

Attempting an on-disk upgrade in certain vSAN configurations may result in failure. Configurations that can cause these errors are:

The stretched vSAN Cluster consists of two ESXi Hosts and the Witness Node (2-node configuration)
Each Host in the Stretched Cluster contains a single vSAN Disk Group
A vSAN cluster consists of three normal nodes, with one disk group per node
A vSAN cluster is very full, preventing the "full data migration" disk-group decommission mode

On-disk upgrade failures due to these configurations may result in the following errors:

A general system error occurred
In the CLOMD.log file on one of the cluster nodes, you see entries similar to:

2016-03-28T22:48:24.610Z 33495 CLOMDecomCMMDSResponseCb: CMMDS update response received: Success
2016-03-28T22:48:24.932Z 33495 CLOMDecomFailureCb: Saw decommission error on some helper node
2016-03-28T22:48:24.932Z 33495 CLOMDecomFailDecommissioning: Failing Decommissioning. Error code Out of resources

For more information on gathering log files, see Collecting vSAN support logs and uploading to VMware (2072796).
When you run the vsan.ondisk_upgrade command using the VMware Ruby vSphere Console (RVC), you see output similar to:

RemoveDiskMapping 192.168.0.206: SystemError: A general system error occurred: Failed to evacuate data for disk uuid 522752b1-8ac5-17dd-a380-d461ea53591f with error: Out of Resources to complete the operation
2016-04-01 13:01:31 -0700: Failed to remove this disk group from Virtual SAN
2016-04-01 13:01:31 -0700: A general system error occurred: Failed to evacuate data for disk uuid 523452b1-8ab5-1edd-a340-d461ea53591c with error: Out of resources to complete the operation
2016-04-01 13:01:39 -0700: Failed to remove diskgroup on host 192.168.0.206 from Virtual SAN.
2016-04-01 13:01:39 -0700: Upgrade tool stopped due to error, please address reported issue and re-run the tool again to finish upgrade.

Note: These log excerpts are examples. Dates, times, identifiers and other items may vary depending on your environment.

Environment

VMware vSAN 6.5.x
VMware vSAN 6.2.x
VMware vSAN 6.6.x

Cause

The on-disk upgrade process attempts to ensure that data in the vSAN cluster is not exposed to loss during the process. To accomplish this, the on-disk upgrade process will evacuate the disk group being upgraded prior to destroying and re-creating the disk group with the new on-disk format version. In minimally-sized configurations, certain ROBO configurations, or clusters with very small amounts of free space, this evacuation may not be possible in a manner that complies with the applicable storage policies. If the "Host Failures to Tolerate" policy specification is set to "1" (as it is by default), disk-group evacuation may not be possible in a policy-complaint manner.

Resolution

To allow upgrade to proceed in these configurations, a compromise to availability must be made. Data accessibility will be maintained, but the redundant copy of the data will be lost and rebuilt during the upgrade process. As a result, the data will be exposed to faults and failures such as the loss of a disk on another node may result in data loss. This exposure to additional failure risk is referred to as "reduced redundancy," and must be manually specified in the Ruby vSphere Console (RVC) to allow the upgrade to proceed. It is not possible to specify reduced redundancy when using the vSphere Web Client to start the upgrade.
For more information on accessing and using RVC, see the VMware Ruby vSphere Console Command Reference guide.

Caution: During upgrade, a single point of failure is exposed. Follow all VMware best practices, and your business practices, regarding the backup of important data and virtual machines.

To complete the upgrade process with reduced redundancy:

Log into RVC and change the directory to the vSAN cluster in question
Run the following command to force the upgrade of vSAN Disk Group at both sites.
vsan.ondisk_upgradecluster_location --allow-reduced-redundancy
For example, if you are already in the cluster folder:
vsan.ondisk_upgrade . --allow-reduced-redundancy
Complete the vSAN upgrade.

In vSAN 6.6 and later, this operation can be done using the vSphere Web Client:

Navigate to vSAN cluster > Configure > General.
Click Upgrade.
Select the checkbox Allow Reduced Redundancy.
Click Yes to confirm.

Note: During this process, there is no way to ensure perfect data redundancy for the vSAN datastore. This state remains through the entire upgrade process and the resulting resync must take place after each disk group is upgraded. Wait for this process to complete and ensure that you have followed all VMware best practices and your business practices related to the backup of important data and virtual machines.

Additional Information

How to collect vSAN support logs and upload to VMware
Cannot perform a Virtual SAN on disk upgrade due to Auto-Claim being enabled