"VUM Remediation (installation) of an ESXi host failed", upgrading ESXi hosts with NVMe Controllers shows as failed during VCF 4.4 upgrade

Products

VMware Cloud Foundation

Issue/Introduction

Symptoms:
While upgrading VMware Cloud Foundation (VCF) to 4.4, you might observe ESXi hosts upgrade failure in a VSAN Cluster with below symptoms :

ESXi host upgrade status shows as Failed with error message "VUM Remediation (installation) of an ESXi host failed" in SDDC UI Tasks pane

Upgrade Status in SDDC UI shows failure as in below screenshot

This issue is observed only on ESXi hosts in VSAN Cluster with NVMe controllers
Verifying the ESXi host upgrade status from vCenter Server shows as upgrade completed successfully, however it doesn't Exit the host from Maintenance Mode automatically
Manually Exiting the ESXi host from Maintenance mode works fine
LCM log on SDDC Manager '/var/log/vmware/vcf/lcm/lcm.log' shows errors similar to below snippets :

2022-02-15T22:11:23.436+0000 ERROR [vcf_lcm,45b1c7f7bedaf3e3,cf7f,upgradeId=4c8d5640-441a-####-####-########790a,resourceType=ESX_HOST,resourceId=46a48c28-c51f-####-####-########c9d,bundleElementId=3742fe8e-ea75-####-####-########6b5]
[c.v.e.s.l.p.i.e.EsxVumUpdateStageRunnerImpl,Async-10] Failed to install update due to unexpected error: VUM Remediate task failed: (vim.TaskInfo) {
key = task-149196,
task = ManagedObjectReference: type = Task, value = task-149196, serverGuid = 77c78dd1-f345-####-####-########634,
descriptionId = com.vmware.vcIntegrity.RemediateTask,
entity = ManagedObjectReference: type = HostSystem, value = host-16, serverGuid = 77c78dd1-f345-####-####-########634,
entityName = <fqdn>,
state = error,
error = (vim.fault.ExtendedFault) {
faultTypeId = com.vmware.vcIntegrity.HostPatchVsanHealthCheckFailureBeforeExitMM,
data = (vim.KeyValue) [
},
(vim.KeyValue) {
key = faultMessage,
value =
},
(vim.KeyValue) {
key = healthCheckStatus,
value = vSAN cluster is not healthy because vSAN health check(s): com.vmware.vsan.health.test.nvmeonhcl failed
}
]
},

VMware Update Manager (VUM) log file '/var/log/vmware/vmware-updatemgr/vum-server/vmware-vum-server.log' on vCenter Server shows errors similar to below snippets :

2022-02-15T22:11:21.724Z info vmware-vum-server[09046] [Originator@6876 sub=VciRemediateTask.RemediateTask{281}] [vciTaskBase 1372] SerializeToVimFault fault:
--> (integrity.fault.HostPatchVsanHealthCheckFailureBeforeExitMM) {
--> faultCause = (vmodl.MethodFault) null,
--> faultMessage = <unset>,
--> healthCheckStatus = "vSAN cluster is not healthy because vSAN health check(s): com.vmware.vsan.health.test.nvmeonhcl failed"
--> msg = ""
--> }
--> Converted fault:
--> (vim.fault.ExtendedFault) {
--> faultCause = (vmodl.MethodFault) null,
--> faultMessage = <unset>,
--> faultTypeId = "com.vmware.vcIntegrity.HostPatchVsanHealthCheckFailureBeforeExitMM",
--> data = (vim.KeyValue) [
--> (vim.KeyValue) {
--> key = "faultCause",
--> value = ""
--> },
--> (vim.KeyValue) {
--> key = "faultMessage",
--> value = ""
--> },
--> (vim.KeyValue) {
--> key = "healthCheckStatus",
--> value = "vSAN cluster is not healthy because vSAN health check(s): com.vmware.vsan.health.test.nvmeonhcl failed"
--> }

Environment

VMware Cloud Foundation 4.4

Cause

This issue is caused due to an error in handling the NVMe controller status as described in vSAN Health Service - vSAN HCL Health – NVMe device can be identified

Resolution

This is a known issue affecting VMware Cloud Foundation 4.4 environments.

Workaround:
To work around the issue, please follow any of the below methods:

Disable NVMe VSAN Health Check during the upgrade and Enable it post upgrade (recommended)

OR

Select the appropriate NVMe Controller in vSAN VCG

OR

Manually Exit the host from Maintenance Mode

Method#1: Disable VSAN Health Check for NVMe Controller during the upgrade and Enable it post upgrade

Please follow below steps (either using vSphere Client UI or using RVC CLI) to temporarily disable the VSAN Health Check for NVMe Controllers during the upgrade and Enable it back once the upgrade is complete.

IMPORTANT NOTES:

As we are disabling the NVMe Health Check during the upgrade, you should manually verify the HCL for the NVMe drive in VSAN HCL
This Health Check needs to be Enabled without fail after completing the upgrade of all the ESXi hosts, otherwise VSAN Health will not notify issues with NVMe, which could lead to catastrophic issues in VSAN environment.

Using vSphere Client - UI

Login to vCenter Server using vSphere Client (https://<vcenter_fqdn>/ui)
Select the VSAN Cluster and select Monitor -> VSAN -> Skyline Health
Select the Health Check "NVMe device is VMware Certified" and Click on "SILENCE ALERT"

Retry the Update from SDDC Manager UI
IMPORTANT STEP - Once all the hosts are upgraded, Enable the health check by selecting "NVMe device is VMware Certified" and Click on "RESTORE ALERT"

Using RVC CLI

Login to vCenter Server Appliance using SSH
Change the shell to Bash using command 'shell' if it is configured with Appliance Shell

Connect to RVC using below command and enter SSO Administrator credentials

rvc localhost

Execute the command "vsan.health.silent_health_check_configure -a 'nvmeonhcl' <clusterPath>" to disable health check for NVMe controller

<clusterPath> - Replace <clusterPath> with full path to the Cluster, in below example, Cluster Name 'VSAN-Cluster' in Datacenter 'VSAN-DC' is used.

root@<hostname> [ ~ ]# rvc localhost
Warning: Permanently added 'localhost' (vim) to the list of known hosts
Using default username "administrator@vsphere.local".
password:
Welcome to RVC. Try the 'help' command.
0 /
1 localhost/
vsan.health.silent_health_check_configure -a 'nvmeonhcl' localhost/VSAN-DC/computers/VSAN-Cluster/
Successfully update silent health check list for VSAN-Cluster

Retry the Update from SDDC Manager UI
Enable the NVMe health check using below command after completing upgrade of all the ESXi hosts in VCF environment:

vsan.health.silent_health_check_configure -r 'nvmeonhcl' <clusterPath>

<clusterPath> - Replace <clusterPath> with full path to the VSAN Cluster

Example :
vsan.health.silent_health_check_configure -r 'nvmeonhcl' localhost/VSAN-DC/computers/VSAN-Cluster/

For more information about Silencing VSAN Health Checks, please refer KB Silencing a vSAN health check

Method#2: Select the appropriate NVMe Controller in vSAN VCG

Select the correct controller with the PCI ID by following below steps :

Login to vCenter Server using UI Client (https://<vcfqdn>/ui)
Select the appropriate NVMe controller in Skyline Health by following the Resolution steps documented in KB vSAN Health Service - vSAN HCL Health – NVMe device can be identified
Retry the Update from SDDC Manager UI

Note: You might have to perform above procedure for each host if the alert is thrown for the host post upgrade.

Method#3: Manually Exit the host from Maintenance Mode

Manually Exit the ESXi host from maintenance mode by following below steps:

Login to vCenter Server using UI Client (https://<vcfqdn>/ui)
Select the host for which upgrade was failed as per the upgrade status in SDDC Manager UI
Verify the Build Number of ESXi host from the Summary Page -> Hypervisor field, it will show as 7.0.3, 19193900 if the host is successfully upgraded to 7.0 U3c
Right Click the Host and Click on Maintenance Mode -> Exit Maintenance Mode
Retry the Update from SDDC Manager UI as shown in below screenshot

Perform the same steps for each host if the upgrade fails with this issue and host is stuck in Maintenance mode