Mitigation
This article describes a mitigation plan that can be used on VCF 4.x releases. The mitigation is to cancel the ongoing upgrade, replace the non-operational ESXi host by removing it and then adding a new ESXi host to the cluster, and then resume the upgrade of the cluster.
Note: The steps below are applicable only when the upgrades have been initiated on each cluster separately. If an upgrade has been initiated on multiple clusters at once, then the cancel operation will stop the ongoing upgrade on all clusters that are part of the upgrade operation.
Steps
1. Get In-Progress Upgrades
a. GET v1/upgrades?status=inprogress
b. The response has the taskId of the current upgrade. (taskId is same as upgradeId)
c. To select the upgrade task for the right cluster, verify that resourceType = CLUSTER and compare the resourceUpgradeSpecs.resourceId to the output of /v1/clusters api.
Sample output from GET /v1/upgrades
{
"id": "########-####-####-####-########3749",
"bundleId": "########-####-####-####-########d087",
"resourceType": "CLUSTER",
"parallelUpgrade": true,
"resourceUpgradeSpecs": [
{
"resourceId": "########-####-####-####-########7651",
"scheduledTimestamp": "2022-10-04T07:44:06.728Z",
"enableQuickboot": false
}
],
"status": "COMPLETED_WITH_SUCCESS",
"taskId": "########-####-####-####-########3749"
}
Sample output from GET /v1/clusters
{
"id": "########-####-####-####-########7651",
"name": "cl2",
"primaryDatastoreName": "vsan01",
"primaryDatastoreType": "VSAN",
"hosts": [
{
"id": "########-####-####-####-########5c8d"
},
{
"id": "########-####-####-####-########e657"
},
{
"id": "########-####-####-####-########5173"
},
{
"id": "########-####-####-####-########3f6e"
}
],
"isStretched": false,
"isDefault": false
}
2. Cancel the in progress upgrade for only the cluster that has a need for host replacement:
a. DELETE v1/tasks/{id}
b. (It is effectively a pause of on-going upgrade)
3. Wait for upgrade to cancel or fail using :
a. GET v1/tasks/{id}
b. Now after the upgrade is cancelled
c. Note: Step-2 will pause/cancel upgrade of pending clusters that are scheduled as part of upgrade request used to trigger the upgrade.
d. Note: If each upgrade scheduled had single cluster, then cancelling of a single upgrade will not cancel/pause other cluster upgrades.
4. Remove the bad ESXi Host from the cluster using the API or the UI.
5. (Optional) If there are no spare ESXi Host(s), decommission the ESXi Host removed from the cluster in step 4.
6. (Optional) Fix the ESXi host issue (optional based on whether there are spare hosts).
7. (Optional) Commission a ESXi Host to be added to the cluster (either a spare ESXi Host or the ESXi Host that was removed in step 4 and then later fixed).
8. Add a new ESXi Host to cluster using the UI or the API.
a. Note: Ensure that the new host is of the target version of the upgrade (and not the source/previous version).
b. Note: Ensure that the new host being added has no pre-existing vSAN disk groups. Expanding a cluster with an ESXi host which have a different vSAN disk format version will lead to partition issues. Any existing vSAN disk groups must be manually removed before expanding the cluster with this new ESXi host. Starting with VCF 4.5, the Add Host workflow in the SDDC Manager validates that there are no disk groups on the host being added. However, for previous VCF versions the validation must be performed manually by executing the command "vdq -q" on the ESXi host. If the output of the command contains any vSAN disks which are in-use they should be removed before starting the cluster expansion.
9. Initiate the cluster upgrade again
a. This will resume the upgrade of the cluster.
Workaround:
Currently there is no workaround.