SDDC Manager vSphere upgrade fails with the error: COMPLETED_WITH

Products

VMware Cloud Foundation

Issue/Introduction

Symptoms:

In the workflow sub-tasks, you see entries similar to:

Aug 10, 2017 5:13:00 PM : Successfully ran upgrade stage ESX_HOST_UPGRADE_STAGE_ENTER_MAINTENANCE_MODE,
Aug 10, 2017 5:13:00 PM : Upgrade element resourceType: ESX_HOST resourceId: 482452b7-c5b2-4d27-9cf6-97cc27182b42 recorded stage ESX_HOST_UPGRADE_STAGE_INSTALL_UPDATE,
Aug 10, 2017 5:13:39 PM : Successfully ran upgrade stage ESX_HOST_UPGRADE_STAGE_INSTALL_UPDATE,
Aug 10, 2017 5:13:39 PM : Upgrade element resourceType: ESX_HOST resourceId: 482452b7-c5b2-4d27-9cf6-97cc27182b42 recorded stage ESX_HOST_UPGRADE_STAGE_REBOOT,
Aug 10, 2017 5:38:01 PM : Upgrade element resourceType: ESX_HOST resourceId: 482452b7-c5b2-4d27-9cf6-97cc27182b42 status changed to COMPLETED_WITH_FAILURE,
Aug 10, 2017 5:38:10 PM : Upgrade element resourceType: ESX_HOST resourceId: 213251df-2b60-4265-8da6-7f1de7bdbed8 status changed to SKIPPED,
Aug 10, 2017 5:38:10 PM : Upgrade element resourceType: ESX_HOST resourceId: 2d09479b-9063-4198-9027-c4c53e155536 status changed to SKIPPED,
Aug 10, 2017 5:38:10 PM : Upgrade element resourceType: ESX_HOST resourceId: f92712bd-eb76-4dca-a1f3-abd9d4bee914 status changed to SKIPPED,
Aug 10, 2017 5:38:10 PM : Upgrade status changed to COMPLETED_WITH_FAILURE,
In the /home/vrack/lcm/logs/lcm.log file on the VRM machine, you see entries similar to:

2017-08-11 13:32:42.143 [https-jsse-nio-192.168.100.106-9443-exec-5] WARN [com.vmware.evo.sddc.lcm.services.impl.InventoryUpgradeServiceImpl] Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0 Failed domain type: IAAS id: 7411b914-6a8a-43be-bdb5-9430afdec8e5 failed items: UpgradeItem [id=482452b7-c5b2-4d27-9cf6-97cc27182b42, type=ESX_HOST, parentId=1e9cf6dd-4035-4d62-b84e-7a52cc2a2553, parentType=VCENTER]

"failedDomains": [

"failedItems": [

2017-08-11 13:34:35.470 [pool-15-thread-10] INFO [com.vmware.evo.sddc.lcm.adapter.inventory.impl.InventoryClientImpl] Adding esxi 482452b7-c5b2-4d27-9cf6-97cc27182b42 to the failed resources

com.vmware.evo.sddc.lcm.primitive.common.connection.BasicConnection$BasicConnectionException: failed to connect: HTTP transport error:java.net.NoRouteToHostException: No route to host (Host unreachable) : No route to host (Host unreachable)

com.vmware.evo.sddc.lcm.primitive.common.connection.helpers.BaseHelper$HelperException: com.vmware.evo.sddc.lcm.primitive.common.connection.BasicConnection$BasicConnectionException: failed to connect: HTTP transport error:java.net.NoRouteToHostException: No route to host (Host unreachable) : No route to host (Host unreachable)

Caused by: com.vmware.evo.sddc.lcm.primitive.common.connection.BasicConnection$BasicConnectionException: failed to connect: HTTP transport error:java.net.NoRouteToHostException: No route to host (Host unreachable) : No route to host (Host unreachable)
The patch/VIB was applied but the update failed in the later stage of the workflow, i.e. somewhere around Rebooting Host/ Entering Maintenance Mode/ Exiting Maintenance Mode.

Environment

VMware Cloud Foundation 2.0.x
VMware Cloud Foundation 2.1.x

Resolution

To work around this issue, reboot the ESXi host in question or manually enter It into maintenance mode or exit it from maintenance mode, depending upon the stage at which the update failed.

This issue can occur due to various reasons. A contributing factor to this issue is the communication between the VRM VM and the ESXi host being upgraded breaks. Therefore, the VRM can no longer send instructions to the ESXi host and the upgrade fails.

Depending upon the stage at which the upgrade fails, do the following:

Failed at Entering Maintenance mode:
1. Manually put the host in maintenance mode from the vCenter (In the vSphere Client inventory.
2. Right-click a host and select Enter Maintenance Mode.
3. Restart the workflow.
Failed at Exiting Maintenance mode:
1. Manually take out the host from maintenance mode,
2. In the vSphere Client inventory, right-click a host and select Exit Maintenance Mode.
3. Restart the workflow.
Failed at Rebooting Host:
1. Manually reboot the host.
2. Right-click the host and click Reboot or Shutdown.
3. Restart the workflow.

After this, ensure that the VRM VM can ping the Esxi host. If it is unable to ping the host do the following.

Ensure that the traffic from the TOR switch is being passed through. If it is not, reset the ports connecting the esxi to the VRM VM or check the switch/port config.

If the data packets are not being dropped at the switch and it the underlying network layer is fine

Check if the NICs on the host are down, if they are down, reset them using these commands:

To change the link state of the physical interface to down:

esxcli network nic down -n vmnicX (where X is the vmnic number)
To change the link state of the physical interface to up:

esxcli network nic up -n vmnicX (where X is the vmnic number)

If the NIC still does not come up, have it un-plugged and plugged physically.

Once the NICs come back up, and the connectivity is re-established, restart the workflow.

Note: Check for the firmware/driver version of the NICs on the esxi host, if it is not latest as per the VMware Hardware Compatibility Guide, update it to the most recent one.

Workaround:

To work around this issue, reboot the ESXi host in question or manually enter It into maintenance mode or exit it from maintenance mode, depending upon the stage at which the update failed.

Depending upon the stage at which the upgrade fails, do the following:

Failed at Entering Maintenance mode:
1. Manually put the host in maintenance mode from the vCenter (In the vSphere Client inventory.
2. Right-click a host and select Enter Maintenance Mode.
3. Restart the workflow.
Failed at Exiting Maintenance mode:
1. Manually take out the host from maintenance mode,
2. In the vSphere Client inventory, right-click a host and select Exit Maintenance Mode.
3. Restart the workflow.

Failed at Rebooting Host:
1. Manually reboot the host.
2. Right-click the host and click Reboot or Shutdown.
3. Restart the workflow.