Maintaining cluster capacity in VCF during upgrades with ESXi hosts.

Products

VMware Cloud Foundation

Issue/Introduction

This article provides details on how to maintain a vSAN cluster's capacity in VCF while there's an on-going upgrade on that cluster and there are ESXi hosts. Given the on-going upgrade on the cluster, this article also provides best practices for expanding vSAN clusters with mixed software versions.

Description
ESXi hosts can become non-operational for many reasons, and this can happen during cluster upgrades. When there are no other cluster operations in progress in VCF, it is possible to replace the ESXi hosts with new ones using day-N add host workflow. When upgrades are running on the cluster, this is not possible currently as the cluster is locked for the duration of the upgrade.

Apart from the procedure described below in the workaround section, there are two important considerations when expanding a partially upgraded vSAN cluster

You must verify the ESXi host being added does not have any existing vSAN disk groups. Adding a new ESXi host with already created vSAN disk groups can lead unexpected network partitions and unexpected loss of data connectivity.
The new ESXi host being added should be of the software version that is the target version of the ongoing upgrade. VCF add host workflow does not support adding a host of the source version if the cluster is partially upgraded.

Symptoms:
A cluster's capacity has been reduced by an ESXi host becoming non-operational during an on-going upgrade.

Environment

VMware Cloud Foundation 4.x
VMware Cloud Foundation 5.0

Cause

ESXi host becoming non-operational during an ongoing upgrade of the cluster which reduces the capacity of the cluster.

Resolution

Mitigation
This article describes a mitigation plan that can be used on VCF 4.x releases. The mitigation is to cancel the ongoing upgrade, replace the non-operational ESXi host by removing it and then adding a new ESXi host to the cluster, and then resume the upgrade of the cluster.

Note: The steps below are applicable only when the upgrades have been initiated on each cluster separately. If an upgrade has been initiated on multiple clusters at once, then the cancel operation will stop the ongoing upgrade on all clusters that are part of the upgrade operation.

Steps

1. Get In-Progress Upgrades

a. GET v1/upgrades?status=inprogress
b. The response has the taskId of the current upgrade. (taskId is same as upgradeId)
c. To select the upgrade task for the right cluster, verify that resourceType = CLUSTER and compare the resourceUpgradeSpecs.resourceId to the output of /v1/clusters api.

Sample output from GET /v1/upgrades
{
"id": "########-####-####-####-########3749",
"bundleId": "########-####-####-####-########d087",
"resourceType": "CLUSTER",
"parallelUpgrade": true,
"resourceUpgradeSpecs": [
{
"resourceId": "########-####-####-####-########7651",
"scheduledTimestamp": "2022-10-04T07:44:06.728Z",
"enableQuickboot": false
}
],
"status": "COMPLETED_WITH_SUCCESS",
"taskId": "########-####-####-####-########3749"
}
Sample output from GET /v1/clusters
{
"id": "########-####-####-####-########7651",
"name": "cl2",
"primaryDatastoreName": "vsan01",
"primaryDatastoreType": "VSAN",
"hosts": [
{
"id": "########-####-####-####-########5c8d"
},
{
"id": "########-####-####-####-########e657"
},
{
"id": "########-####-####-####-########5173"
},
{
"id": "########-####-####-####-########3f6e"
}
],
"isStretched": false,
"isDefault": false
}

2. Cancel the in progress upgrade for only the cluster that has a need for host replacement:

a. DELETE v1/tasks/{id}
b. (It is effectively a pause of on-going upgrade)

3. Wait for upgrade to cancel or fail using :

a. GET v1/tasks/{id}
b. Now after the upgrade is cancelled
c. Note: Step-2 will pause/cancel upgrade of pending clusters that are scheduled as part of upgrade request used to trigger the upgrade.
d. Note: If each upgrade scheduled had single cluster, then cancelling of a single upgrade will not cancel/pause other cluster upgrades.

4. Remove the bad ESXi Host from the cluster using the API or the UI.

5. (Optional) If there are no spare ESXi Host(s), decommission the ESXi Host removed from the cluster in step 4.

6. (Optional) Fix the ESXi host issue (optional based on whether there are spare hosts).

7. (Optional) Commission a ESXi Host to be added to the cluster (either a spare ESXi Host or the ESXi Host that was removed in step 4 and then later fixed).

8. Add a new ESXi Host to cluster using the UI or the API.

a. Note: Ensure that the new host is of the target version of the upgrade (and not the source/previous version).
b. Note: Ensure that the new host being added has no pre-existing vSAN disk groups. Expanding a cluster with an ESXi host which have a different vSAN disk format version will lead to partition issues. Any existing vSAN disk groups must be manually removed before expanding the cluster with this new ESXi host. Starting with VCF 4.5, the Add Host workflow in the SDDC Manager validates that there are no disk groups on the host being added. However, for previous VCF versions the validation must be performed manually by executing the command "vdq -q" on the ESXi host. If the output of the command contains any vSAN disks which are in-use they should be removed before starting the cluster expansion.

9. Initiate the cluster upgrade again

a. This will resume the upgrade of the cluster.

Workaround:
Currently there is no workaround.