"VUM Remediation (installation) of an ESXi host failed", upgrading ESXi hosts with NVMe Controllers shows as failed during VCF 4.4 upgrade
search cancel

"VUM Remediation (installation) of an ESXi host failed", upgrading ESXi hosts with NVMe Controllers shows as failed during VCF 4.4 upgrade

book

Article ID: 322188

calendar_today

Updated On:

Products

VMware Cloud Foundation

Issue/Introduction

Symptoms:
While upgrading VMware Cloud Foundation (VCF) to 4.4, you might observe ESXi hosts upgrade failure in a VSAN Cluster with below symptoms :
  • ESXi host upgrade status shows as Failed with error message "VUM Remediation (installation) of an ESXi host failed" in SDDC UI Tasks pane
esx_upgrade_failure_nvme.jpg
  • Upgrade Status in SDDC UI shows failure as in below screenshot
esx_upgrade_failure_nvme_sddc.jpg
  • This issue is observed only on ESXi hosts in VSAN Cluster with NVMe controllers
  • Verifying the ESXi host upgrade status from vCenter Server shows as upgrade completed successfully, however it doesn't Exit the host from Maintenance Mode automatically  
  • Manually Exiting the ESXi host from Maintenance mode works fine
  • LCM log on SDDC Manager '/var/log/vmware/vcf/lcm/lcm.log' shows errors similar to below snippets :
2022-02-15T22:11:23.436+0000 ERROR [vcf_lcm,45b1c7f7bedaf3e3,cf7f,upgradeId=4c8d5640-441a-4928-b196-e69c69e4790a,resourceType=ESX_HOST,resourceId=46a48c28-c51f-4d5c-92cd-308c393b8c9d,bundleElementId=3742fe8e-ea75-4ed4-a3f5-6085597be6b5]
[c.v.e.s.l.p.i.e.EsxVumUpdateStageRunnerImpl,Async-10] Failed to install update due to unexpected error: VUM Remediate task failed: (vim.TaskInfo) {
   key = task-149196,
   task = ManagedObjectReference: type = Task, value = task-149196, serverGuid = 77c78dd1-f345-4987-867b-15fd0fea5634,
   descriptionId = com.vmware.vcIntegrity.RemediateTask,
   entity = ManagedObjectReference: type = HostSystem, value = host-16, serverGuid = 77c78dd1-f345-4987-867b-15fd0fea5634,
   entityName = <fqdn>,
   state = error,
   error = (vim.fault.ExtendedFault) {
      faultTypeId = com.vmware.vcIntegrity.HostPatchVsanHealthCheckFailureBeforeExitMM,
      data = (vim.KeyValue) [
         },
         (vim.KeyValue) {
            key = faultMessage,
            value =
         },
         (vim.KeyValue) {
            key = healthCheckStatus,
            value = vSAN cluster is not healthy because vSAN health check(s): com.vmware.vsan.health.test.nvmeonhcl failed
         }
      ]
   },
  • VMware Update Manager (VUM) log file '/var/log/vmware/vmware-updatemgr/vum-server/vmware-vum-server.log' on vCenter Server shows errors similar to below snippets :
2022-02-15T22:11:21.724Z info vmware-vum-server[09046] [Originator@6876 sub=VciRemediateTask.RemediateTask{281}] [vciTaskBase 1372] SerializeToVimFault fault:
--> (integrity.fault.HostPatchVsanHealthCheckFailureBeforeExitMM) {
-->    faultCause = (vmodl.MethodFault) null,
-->    faultMessage = <unset>,
-->    healthCheckStatus = "vSAN cluster is not healthy because vSAN health check(s): com.vmware.vsan.health.test.nvmeonhcl failed"
-->    msg = ""
--> }
--> Converted fault:
--> (vim.fault.ExtendedFault) {
-->    faultCause = (vmodl.MethodFault) null,
-->    faultMessage = <unset>,
-->    faultTypeId = "com.vmware.vcIntegrity.HostPatchVsanHealthCheckFailureBeforeExitMM",
-->    data = (vim.KeyValue) [
-->       (vim.KeyValue) {
-->          key = "faultCause",
-->          value = ""
-->       },
-->       (vim.KeyValue) {
-->          key = "faultMessage",
-->          value = ""
-->       },
-->       (vim.KeyValue) {
-->          key = "healthCheckStatus",
-->          value = "vSAN cluster is not healthy because vSAN health check(s): com.vmware.vsan.health.test.nvmeonhcl failed"
-->       }


Environment

VMware Cloud Foundation 4.4

Cause

This issue is caused due to an error in handling the NVMe controller status as described in vSAN Health Service - vSAN HCL Health – NVMe device can be identified

Resolution

This is a known issue affecting VMware Cloud Foundation 4.4 environments.

Workaround:
To workaround the issue, please follow any of the below methods :
  1. Disable NVMe VSAN Health Check during the upgrade and Enable it post upgrade (recommended)

OR
  1. Select the appropriate NVMe Controller in vSAN VCG

OR
  1. Manually Exit the host from Maintenance Mode

 

Method#1: Disable VSAN Health Check for NVMe Controller during the upgrade and Enable it post upgrade

Please follow below steps (either using vSphere Client UI or using RVC CLI) to temporarily disable the VSAN Health Check for NVMe Controllers during the upgrade and Enable it back once the upgrade is complete.

IMPORTANT NOTES:
  1. As we are disabling the NVMe Health Check during the upgrade, you should manually verify the HCL for the NVMe drive in VSAN HCL
  2. This Health Check needs to be Enabled without fail after completing the upgrade of all the ESXi hosts, otherwise VSAN Health will not notify issues with NVMe, which could lead to catastrophic issues in VSAN environment.
 
Using vSphere Client - UI
  • Login to vCenter Server using vSphere Client (https://<vcenter_fqdn>/ui)
  • Select the VSAN Cluster and select Monitor -> VSAN -> Skyline Health
  • Select the Health Check "NVMe device is VMware Certified" and Click on "SILENCE ALERT"
NVMe_Alert.png
  • Retry the Update from SDDC Manager UI
  • IMPORTANT STEP - Once all the hosts are upgraded, Enable the health check by selecting "NVMe device is VMware Certified" and Click on "RESTORE ALERT
NVMe_Alert_Enable.jpg

Using RVC CLI 
  • Login to vCenter Server Appliance using SSH
  • Change the shell to Bash using command 'shell' if it is configured with Appliance Shell
bash_shell_7u3c.png
  • Connect to RVC using below command and enter SSO Administrator credentials
rvc localhost
  • Execute the command "vsan.health.silent_health_check_configure -a 'nvmeonhcl' <clusterPath>" to disable health check for NVMe controller
<clusterPath> - Replace <clusterPath> with full path to the Cluster, in below example, Cluster Name 'VSAN-Cluster' in Datacenter 'VSAN-DC' is used.

root@<hostname> [ ~ ]# rvc localhost
Warning: Permanently added 'localhost' (vim) to the list of known hosts
Using default username "[email protected]".
password:
Welcome to RVC. Try the 'help' command.
0 /
1 localhost/
vsan.health.silent_health_check_configure -a 'nvmeonhcl' localhost/VSAN-DC/computers/VSAN-Cluster/
Successfully update silent health check list for VSAN-Cluster
  • Retry the Update from SDDC Manager UI
  • Enable the NVMe health check using below command after completing upgrade of all the ESXi hosts in VCF environment:
vsan.health.silent_health_check_configure -r 'nvmeonhcl' <clusterPath>

<clusterPath> - Replace <clusterPath> with full path to the VSAN Cluster

Example :
vsan.health.silent_health_check_configure -r 'nvmeonhcl' localhost/VSAN-DC/computers/VSAN-Cluster/

For more information about Silencing VSAN Health Checks, please refer KB Silencing a vSAN health check
 

Method#2: Select the appropriate NVMe Controller in vSAN VCG

Select the correct controller with the PCI ID by following below steps :
Retry_update.png

Note: You might have to perform above procedure for each host if the alert is thrown for the host post upgrade.

 

Method#3: Manually Exit the host from Maintenance Mode

Manually Exit the ESXi host from maintenance mode by following below steps:
  • Login to vCenter Server using UI Client (https://<vcfqdn>/ui)
  • Select the host for which upgrade was failed as per the upgrade status in SDDC Manager UI
  • Verify the Build Number of ESXi host from the Summary Page -> Hypervisor field, it will show as 7.0.3, 19193900 if the host is successfully upgraded to 7.0 U3c
  • Right Click the Host and Click on Maintenance Mode -> Exit Maintenance Mode
  • Retry the Update from SDDC Manager UI as shown in below screenshot
Retry_update.png
  • Perform the same steps for each host if the upgrade fails with this issue and host is stuck in Maintenance mode